[jira] [Resolved] (SPARK-18790) Keep a general offset history of stream batches
[ https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18790. -- Resolution: Fixed Assignee: Tyson Condie Fix Version/s: 2.1.1 2.0.3 Target Version/s: (was: 2.1.0) > Keep a general offset history of stream batches > --- > > Key: SPARK-18790 > URL: https://issues.apache.org/jira/browse/SPARK-18790 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tyson Condie >Assignee: Tyson Condie > Fix For: 2.0.3, 2.1.1 > > > Instead of only keeping the minimum number of offsets around, we should keep > enough information to allow us to roll back n batches and reexecute the > stream starting from a given point. In particular, we should create a config > in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and > ensure that we keep enough log files in the following places to roll back the > specified number of batches: > the offsets that are present in each batch > versions of the state store > the files lists stored for the FileStreamSource > the metadata log stored by the FileStreamSink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18828) Refactor SparkR build and test scripts
[ https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18828: Assignee: (was: Apache Spark) > Refactor SparkR build and test scripts > -- > > Key: SPARK-18828 > URL: https://issues.apache.org/jira/browse/SPARK-18828 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Since we are building SparkR source package we are now seeing the call tree > getting more convoluted and more parts are getting duplicated. > We should try to clean this up. > One issue is with the requirement to install SparkR before building SparkR > source package (ie. R CMD build) because of the loading of SparkR via > "library(SparkR)" in the vignettes. When we refactor that part in the > vignettes we should be able to further decouple the scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18828) Refactor SparkR build and test scripts
[ https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18828: Assignee: Apache Spark > Refactor SparkR build and test scripts > -- > > Key: SPARK-18828 > URL: https://issues.apache.org/jira/browse/SPARK-18828 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark > > Since we are building SparkR source package we are now seeing the call tree > getting more convoluted and more parts are getting duplicated. > We should try to clean this up. > One issue is with the requirement to install SparkR before building SparkR > source package (ie. R CMD build) because of the loading of SparkR via > "library(SparkR)" in the vignettes. When we refactor that part in the > vignettes we should be able to further decouple the scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18828) Refactor SparkR build and test scripts
[ https://issues.apache.org/jira/browse/SPARK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741200#comment-15741200 ] Apache Spark commented on SPARK-18828: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16249 > Refactor SparkR build and test scripts > -- > > Key: SPARK-18828 > URL: https://issues.apache.org/jira/browse/SPARK-18828 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Since we are building SparkR source package we are now seeing the call tree > getting more convoluted and more parts are getting duplicated. > We should try to clean this up. > One issue is with the requirement to install SparkR before building SparkR > source package (ie. R CMD build) because of the loading of SparkR via > "library(SparkR)" in the vignettes. When we refactor that part in the > vignettes we should be able to further decouple the scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18828) Refactor SparkR build and test scripts
Felix Cheung created SPARK-18828: Summary: Refactor SparkR build and test scripts Key: SPARK-18828 URL: https://issues.apache.org/jira/browse/SPARK-18828 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung Since we are building SparkR source package we are now seeing the call tree getting more convoluted and more parts are getting duplicated. We should try to clean this up. One issue is with the requirement to install SparkR before building SparkR source package (ie. R CMD build) because of the loading of SparkR via "library(SparkR)" in the vignettes. When we refactor that part in the vignettes we should be able to further decouple the scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18570) Consider supporting other R formula operators
[ https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18570: - Priority: Minor (was: Major) > Consider supporting other R formula operators > - > > Key: SPARK-18570 > URL: https://issues.apache.org/jira/browse/SPARK-18570 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung >Priority: Minor > > Such as > {code} > ∗ > X∗Y include these variables and the interactions between them > ^ > (X + Z + W)^3 include these variables and all interactions up to three way > | > X | Z conditioning: include x given z > {code} > Other includes, %in%, ` (backtick) > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18569: - Affects Version/s: (was: 2.2.0) Target Version/s: 2.2.0 > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18569: - Affects Version/s: 2.2.0 > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18570) Consider supporting other R formula operators
[ https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18570: - Target Version/s: 2.2.0 > Consider supporting other R formula operators > - > > Key: SPARK-18570 > URL: https://issues.apache.org/jira/browse/SPARK-18570 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung > > Such as > {code} > ∗ > X∗Y include these variables and the interactions between them > ^ > (X + Z + W)^3 include these variables and all interactions up to three way > | > X | Z conditioning: include x given z > {code} > Other includes, %in%, ` (backtick) > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18348: - Target Version/s: 2.2.0 > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741169#comment-15741169 ] Yanbo Liang commented on SPARK-10413: - [~anshbansal] Yeah, we will put this feature at a high priority in Spark 2.2 release cycle. I think there is no JIRA ticket for predict method on the whole pipeline model, it depends on this feature. Thanks. > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10413: Labels: (was: 2.2.0) > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10413: Labels: 2.2.0 (was: ) > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > Labels: 2.2.0 > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10884: Labels: 2.2.0 (was: ) > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-10884: --- Assignee: Yanbo Liang > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18827) Cann't cache broadcast to disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18827: Assignee: Apache Spark > Cann't cache broadcast to disk > -- > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > It will fail on spark2.0.1, spark2.0.2 and spark2.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18827) Cann't cache broadcast to disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18827: Assignee: (was: Apache Spark) > Cann't cache broadcast to disk > -- > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > It will fail on spark2.0.1, spark2.0.2 and spark2.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18827) Cann't cache broadcast to disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741061#comment-15741061 ] Apache Spark commented on SPARK-18827: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/16252 > Cann't cache broadcast to disk > -- > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > It will fail on spark2.0.1, spark2.0.2 and spark2.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18826) Make FileStream be able to start with most recent files
[ https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18826: Assignee: Shixiong Zhu (was: Apache Spark) > Make FileStream be able to start with most recent files > --- > > Key: SPARK-18826 > URL: https://issues.apache.org/jira/browse/SPARK-18826 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When starting a stream with a lot of backfill and maxFilesPerTrigger, the > user could often want to start with most recent files first. This would let > you keep low latency for recent data and slowly backfill historical data. > It's better to add an option to control this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18826) Make FileStream be able to start with most recent files
[ https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18826: Assignee: Apache Spark (was: Shixiong Zhu) > Make FileStream be able to start with most recent files > --- > > Key: SPARK-18826 > URL: https://issues.apache.org/jira/browse/SPARK-18826 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > When starting a stream with a lot of backfill and maxFilesPerTrigger, the > user could often want to start with most recent files first. This would let > you keep low latency for recent data and slowly backfill historical data. > It's better to add an option to control this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18826) Make FileStream be able to start with most recent files
[ https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741011#comment-15741011 ] Apache Spark commented on SPARK-18826: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16251 > Make FileStream be able to start with most recent files > --- > > Key: SPARK-18826 > URL: https://issues.apache.org/jira/browse/SPARK-18826 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When starting a stream with a lot of backfill and maxFilesPerTrigger, the > user could often want to start with most recent files first. This would let > you keep low latency for recent data and slowly backfill historical data. > It's better to add an option to control this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18827) Cann't cache broadcast to disk
[ https://issues.apache.org/jira/browse/SPARK-18827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741009#comment-15741009 ] Yuming Wang commented on SPARK-18827: - I will create a PR later. > Cann't cache broadcast to disk > -- > > Key: SPARK-18827 > URL: https://issues.apache.org/jira/browse/SPARK-18827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Yuming Wang > > How to reproduce it: > {code:java} > test("Cache broadcast to disk") { > val conf = new SparkConf() > .setAppName("Cache broadcast to disk") > .setMaster("local") > .set("spark.memory.useLegacyMode", "true") > .set("spark.storage.memoryFraction", "0.0") > sc = new SparkContext(conf) > val list = List[Int](1, 2, 3, 4) > val broadcast = sc.broadcast(list) > assert(broadcast.value.sum === 10) > } > {code} > It will fail on spark2.0.1, spark2.0.2 and spark2.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18827) Cann't cache broadcast to disk
Yuming Wang created SPARK-18827: --- Summary: Cann't cache broadcast to disk Key: SPARK-18827 URL: https://issues.apache.org/jira/browse/SPARK-18827 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2, 2.0.1, 2.1.0 Reporter: Yuming Wang How to reproduce it: {code:java} test("Cache broadcast to disk") { val conf = new SparkConf() .setAppName("Cache broadcast to disk") .setMaster("local") .set("spark.memory.useLegacyMode", "true") .set("spark.storage.memoryFraction", "0.0") sc = new SparkContext(conf) val list = List[Int](1, 2, 3, 4) val broadcast = sc.broadcast(list) assert(broadcast.value.sum === 10) } {code} It will fail on spark2.0.1, spark2.0.2 and spark2.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18826) Make FileStream be able to start with most recent files
Shixiong Zhu created SPARK-18826: Summary: Make FileStream be able to start with most recent files Key: SPARK-18826 URL: https://issues.apache.org/jira/browse/SPARK-18826 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. It's better to add an option to control this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15572) MLlib in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-15572: Shepherd: Yanbo Liang > MLlib in R format: compatibility with other languages > - > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15572) MLlib in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740967#comment-15740967 ] Yanbo Liang edited comment on SPARK-15572 at 12/12/16 4:50 AM: --- Sure, that's great. I updated myself as the shepherd. was (Author: yanboliang): Sure, that great. I updated me as the shepherd. > MLlib in R format: compatibility with other languages > - > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15572) MLlib in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740967#comment-15740967 ] Yanbo Liang commented on SPARK-15572: - Sure, that great. I updated me as the shepherd. > MLlib in R format: compatibility with other languages > - > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-18325. - Resolution: Fixed Fix Version/s: 2.1.1 > SparkR 2.1 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-18325 > URL: https://issues.apache.org/jira/browse/SPARK-18325 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > Fix For: 2.1.1 > > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740943#comment-15740943 ] Yanbo Liang commented on SPARK-18325: - Since PR 16148 has been merged, I think we can resolve this task. PR 16214 is a follow-up work which not strongly required in this release (may be merged after 2.1). So I will resolve this in case blocking 2.1 release. Thanks. > SparkR 2.1 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-18325 > URL: https://issues.apache.org/jira/browse/SPARK-18325 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740906#comment-15740906 ] caolan commented on SPARK-17147: I am using spark 2.0.0 + kafka 0.10 + compact mode topics even in some production environment. This fix is really important. so the question is how to decide it is important or not. Compact kafka topic should be widely used now, spark 2.0 should support it well. For the other issue, it did not happen all the time, did not have regular pattern, several times one day or did not happen in several days. so I should enlarge the spark.streaming.kafka.consumer.poll.ms, right. > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18824: Assignee: (was: Apache Spark) > Add optimizer rule to reorder expensive Filter predicates like ScalaUDF > --- > > Key: SPARK-18824 > URL: https://issues.apache.org/jira/browse/SPARK-18824 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > During evaluation of predicates in Filter, we can reorder the expressions in > order to evaluate the more expensive expressions like ScalaUDF later. So if > other expressions are evaluated to false, we can avoid evaluation of these > UDFs. > We can add an optimizer rule to do this optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740903#comment-15740903 ] Apache Spark commented on SPARK-18824: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16245 > Add optimizer rule to reorder expensive Filter predicates like ScalaUDF > --- > > Key: SPARK-18824 > URL: https://issues.apache.org/jira/browse/SPARK-18824 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > During evaluation of predicates in Filter, we can reorder the expressions in > order to evaluate the more expensive expressions like ScalaUDF later. So if > other expressions are evaluated to false, we can avoid evaluation of these > UDFs. > We can add an optimizer rule to do this optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-18824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18824: Assignee: Apache Spark > Add optimizer rule to reorder expensive Filter predicates like ScalaUDF > --- > > Key: SPARK-18824 > URL: https://issues.apache.org/jira/browse/SPARK-18824 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > During evaluation of predicates in Filter, we can reorder the expressions in > order to evaluate the more expensive expressions like ScalaUDF later. So if > other expressions are evaluated to false, we can avoid evaluation of these > UDFs. > We can add an optimizer rule to do this optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740557#comment-15740557 ] Joseph K. Bradley edited comment on SPARK-18332 at 12/12/16 4:05 AM: - Let's do it after the 2.1 release. We can always update the docs post-hoc. I made a JIRA for it: [SPARK-18825] was (Author: josephkb): Let's do it after the 2.1 release. We can always update the docs post-hoc. I'll make a JIRA for it. > SparkR 2.1 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18825) Eliminate duplicate links in SparkR API doc index
Joseph K. Bradley created SPARK-18825: - Summary: Eliminate duplicate links in SparkR API doc index Key: SPARK-18825 URL: https://issues.apache.org/jira/browse/SPARK-18825 Project: Spark Issue Type: Documentation Components: Documentation, SparkR Reporter: Joseph K. Bradley The SparkR API docs contain many duplicate links with suffixes {{-method}} or {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same doc. Copying from [~felixcheung] in [SPARK-18332]: {quote} They are because of the {{@ aliases}} tags. I think we are adding them because CRAN checks require them to match the specific format - [~shivaram] would you know? I am pretty sure they are double-listed because in addition to aliases we also have {{@ rdname}} which automatically generate the links as well. I suspect if we change all the rdname to match the string in aliases then there will be one link. I can take a shot at this to test this out, but changes will be very extensive - is this something we could get into 2.1 still? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18824) Add optimizer rule to reorder expensive Filter predicates like ScalaUDF
Liang-Chi Hsieh created SPARK-18824: --- Summary: Add optimizer rule to reorder expensive Filter predicates like ScalaUDF Key: SPARK-18824 URL: https://issues.apache.org/jira/browse/SPARK-18824 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh During evaluation of predicates in Filter, we can reorder the expressions in order to evaluate the more expensive expressions like ScalaUDF later. So if other expressions are evaluated to false, we can avoid evaluation of these UDFs. We can add an optimizer rule to do this optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740807#comment-15740807 ] Kazuaki Ishizaki commented on SPARK-16073: -- It is an interesting topic. In the current situation, SPARK-16043 will not be merged soon. This is because performance issues for DataFrame/Dataset programs with primitive arrays are addressed by other approaches. If there are some bench programs for this measurement, I am happy to run them with SPARK-16043. Are there any benchmark programs? > Performance of Parquet encodings on saving primitive arrays > --- > > Key: SPARK-16073 > URL: https://issues.apache.org/jira/browse/SPARK-16073 > Project: Spark > Issue Type: Task > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet > data. However, Parquet also has its own encodings to compress columns/arrays, > e.g., dictionary encoding: > https://github.com/apache/parquet-format/blob/master/Encodings.md. > It might be worth checking the performance overhead of Parquet encodings on > saving large primitive arrays, which is a machine learning use case. If the > overhead is significant, we should expose a configuration in Spark to control > the encoding levels. > Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18806) driverwrapper and executor doesn't exit when worker killed
[ https://issues.apache.org/jira/browse/SPARK-18806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740792#comment-15740792 ] liujianhui commented on SPARK-18806: no, it's a problem, sometimes there are exist two same driver! and the coarseexecutor with zombie state will reserved the memory even the worker had been exited > driverwrapper and executor doesn't exit when worker killed > -- > > Key: SPARK-18806 > URL: https://issues.apache.org/jira/browse/SPARK-18806 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 > Environment: java1.8 >Reporter: liujianhui > > submit an application with standlone-cluster mode, and then the master will > launch executor and driverwrapper on worker. They are all start WorkerWatcher > to watch the worker, as a result, when the worker killed by manual, the > driverwrapper and executor sometimes will not exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
[ https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740754#comment-15740754 ] jin xing commented on SPARK-18820: -- [~lins05] Thanks a lot for your comment : ) In our company's cluster, we find lots of the NullPointerException described above; Checking source code, I found CoarseGrainedSchedulerBackend will executorDataMap first, then reply "RegisteredExecutor"; After updating executorDataMap, the new joined executor may be sent "LaunchTask", which will result in "LaunchTask" arrives before than "RegisteredExecutor"; How do you think about this? > Driver may send "LaunchTask" before executor receive "RegisteredExecutor" > - > > Key: SPARK-18820 > URL: https://issues.apache.org/jira/browse/SPARK-18820 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.3 > Environment: spark-1.6.3 >Reporter: jin xing > > CoarseGrainedSchedulerBackend will update executorDataMap after receiving > "RegisterExecutor", thus task scheduler may assign tasks on to this executor; > If LaunchTask arrives at CoarseGrainedExecutorBackend before > RegisteredExecutor, it will result in NullPointerException and executor > backend will exit; > Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" > after "RegisteredExecutor" is already received. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18823) Assignation by column name variable not available or bug?
Vicente Masip created SPARK-18823: - Summary: Assignation by column name variable not available or bug? Key: SPARK-18823 URL: https://issues.apache.org/jira/browse/SPARK-18823 Project: Spark Issue Type: Question Components: SparkR Affects Versions: 2.0.2 Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 4. Or databricks (community.cloud.databricks.com) . Reporter: Vicente Masip Fix For: 2.0.2 I really don't know if this is a bug or can be done with some function: Sometimes is very important to assign something to a column which name has to be access trough a variable. Normally, I have always used it with doble brackets likes this out of SparkR problems: # df could be faithful normal data frame or data table. # accesing by variable name: myname = "waiting" df[[myname]] <- c(1:nrow(df)) # or even column number df[[2]] <- df$eruptions The error is not caused by the right side of the "<-" operator of assignment. The problem is that I can't assign to a column name using a variable or column number as I do in this examples out of spark. Doesn't matter if I am modifying or creating column. Same problem. I have also tried to use this with no results: val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740557#comment-15740557 ] Joseph K. Bradley commented on SPARK-18332: --- Let's do it after the 2.1 release. We can always update the docs post-hoc. I'll make a JIRA for it. > SparkR 2.1 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740419#comment-15740419 ] Nicholas Chammas commented on SPARK-13587: -- Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to develop a solution that works for Spark on YARN: {code} set -e # Both these directories exist on all of our YARN nodes. # Otherwise, everything else is built and shipped out at submit-time # with our application. export HADOOP_CONF_DIR="/etc/hadoop/conf" export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6" export PATH="$SPARK_HOME/bin:$PATH" python3 -m venv venv/ source venv/bin/activate pip install -U pip pip install -r requirements.pip pip install -r requirements-dev.pip deactivate # This convoluted zip machinery is to ensure that the paths to the files inside the zip # look the same to Python when it runs within YARN. # If there is a simpler way to express this, I'd be interested to know! pushd venv/ zip -rq ../venv.zip * popd pushd myproject/ zip -rq ../myproject.zip * popd pushd tests/ zip -rq ../tests.zip * popd export PYSPARK_PYTHON="venv/bin/python" spark-submit \ --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \ --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ --master yarn \ --deploy-mode client \ --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \ run_tests.py -v {code} My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here! As Ben explains in [his blog post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the {{--archives}} option. I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark. And I wonder... if this {{--archives}} technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740216#comment-15740216 ] Felix Cheung commented on SPARK-18813: -- This is great, Joseph. Thanks for putting down the framework on this. > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | 1 | next release | Blocker | *must* | *must* | *must* | > | 2 | next release | Critical | *must* | yes, unless small | *best effort* | > | 3 | next release | Major | *must* | optional | *best effort* | > | 4 | next release | Minor | optional | no | maybe | > | 5 | next release | Trivial | optional | no | maybe | > | 6 | (empty) | (any) | yes | no | maybe | > | 7 | (empty) | (any) | no | no | maybe | > The *Category* in the table above has the following meaning: > 1. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. > 2-3. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. The issue may slip to the > next release if development is slower than expected. > 4-5. A committer has promised interest in this issue. Contributions *will* > receive attention. The issue may slip to another release. > 6. A committer has promised interest in this issue and should respond, but no > promises are made about priorities or releases. > 7. This issue is open for discussion, but it needs a committer to promise > interest to proceed. > h1. Instructions > h2. For contributors > Getting started > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time contributor, please always start with a small > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a larger feature. > Coordinating on JIRA > * Never work silently. Let everyone know on the corresponding JIRA page when > you start work. This is to avoid duplicate work. For small patches, you do > not need to get the JIRA assigned to you to begin work. > * For medium/large features or features with dependencies, please get > assigned first before coding and keep the ETA updated on the JIRA. If there > is no activity on the JIRA page for a certain amount of time, the JIRA should > be released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Do not set these fields: Target Version, Fix Version, or Shepherd. Only > Committers should set those. > Writing and reviewing PRs > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours.* > h2. For Committers > Adding to this roadmap > * You can update the roadmap by (a) adding issues to this list and (b) > setting Target Versions. Only Committers may make these changes. > * *If you add an issue to this roadmap or set a Target Version, you _must_ > assign yourself or another Committer as Shepherd.* > * This list should be actively mana
[jira] [Updated] (SPARK-18821) Bisecting k-means wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18821: - Shepherd: Felix Cheung > Bisecting k-means wrapper in SparkR > --- > > Key: SPARK-18821 > URL: https://issues.apache.org/jira/browse/SPARK-18821 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > Implement a wrapper in SparkR to support bisecting k-means -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18822: - Shepherd: Felix Cheung > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-15767: - Shepherd: Felix Cheung > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740208#comment-15740208 ] Michael Kamprath commented on SPARK-18819: -- One more note, this issue only arises when doubles are in the Parquet file. This code runs just fine in the ARM71 environment: {code} from pyspark.sql.types import * rdd2 = sc.parallelize([('row3',1,5,'name'),('row4',2,6,'string')]) my_schema2 = StructType([ StructField("id", StringType(), True), StructField("value1", IntegerType(), True), StructField("value2", IntegerType(), True), StructField("name",StringType(), True) ]) df2 = spark.createDataFrame( rdd2, schema=my_schema2) df2.coalesce(1).write.parquet('hdfs://master:9000/user/michael/test_data2',mode='overwrite') newdf2 = spark.read.parquet('hdfs://master:9000/user/michael/test_data2/') newdf2.take(1) {code} ARM71 requires doubles to be 8-byte aligned. So this is the first time I am digging into the Spark code ... is [SPARK-16962|https://github.com/apache/spark/pull/14762] a similar issue? I see that issue didn't address double alignment. > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache
[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18822: - Description: >From Joseph Bradley: " Supporting Pipelines and advanced use cases: There really needs to be more design discussion around SparkR. Felix Cheung would you be interested in leading some discussion? I'm envisioning something similar to what was done a while back for Pipelines in Scala/Java/Python, where we consider several use cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, and working with multiple languages. That should help inform what APIs should look like in Spark R. " Certain ML model, such as OneVsRest, is harder to represent in a single call R API. Having advanced API or Pipeline API like this could help to expose that to our users. was: >From Joseph Bradley: " Supporting Pipelines and advanced use cases: There really needs to be more design discussion around SparkR. Felix Cheung would you be interested in leading some discussion? I'm envisioning something similar to what was done a while back for Pipelines in Scala/Java/Python, where we consider several use cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, and working with multiple languages. That should help inform what APIs should look like in Spark R. " Certain ML model, such as OneVsRest, is harder to represent in a single call R API. Having advanced API or Pipeline API like this could help to expose that to our users > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18822: - Description: >From Joseph Bradley: " Supporting Pipelines and advanced use cases: There really needs to be more design discussion around SparkR. Felix Cheung would you be interested in leading some discussion? I'm envisioning something similar to what was done a while back for Pipelines in Scala/Java/Python, where we consider several use cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, and working with multiple languages. That should help inform what APIs should look like in Spark R. " Certain ML model, such as OneVsRest, is harder to represent in a single call R API. Having advanced API or Pipeline API like this could help to expose that to our users was: >From Joseph Bradley: " Supporting Pipelines and advanced use cases: There really needs to be more design discussion around SparkR. Felix Cheung would you be interested in leading some discussion? I'm envisioning something similar to what was done a while back for Pipelines in Scala/Java/Python, where we consider several use cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, and working with multiple languages. That should help inform what APIs should look like in Spark R. " > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740184#comment-15740184 ] Felix Cheung edited comment on SPARK-18813 at 12/11/16 7:11 PM: I added a couple of JIRAs for R that can be found with [this query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC] We could turn them into subtasks if we are having umbrella was (Author: felixcheung): I added a couple of JIRAs for R that can be found with [this query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC] > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | 1 | next release | Blocker | *must* | *must* | *must* | > | 2 | next release | Critical | *must* | yes, unless small | *best effort* | > | 3 | next release | Major | *must* | optional | *best effort* | > | 4 | next release | Minor | optional | no | maybe | > | 5 | next release | Trivial | optional | no | maybe | > | 6 | (empty) | (any) | yes | no | maybe | > | 7 | (empty) | (any) | no | no | maybe | > The *Category* in the table above has the following meaning: > 1. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. > 2-3. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. The issue may slip to the > next release if development is slower than expected. > 4-5. A committer has promised interest in this issue. Contributions *will* > receive attention. The issue may slip to another release. > 6. A committer has promised interest in this issue and should respond, but no > promises are made about priorities or releases. > 7. This issue is open for discussion, but it needs a committer to promise > interest to proceed. > h1. Instructions > h2. For contributors > Getting started > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time contributor, please always start with a small > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a larger feature. > Coordinating on JIRA > * Never work silently. Let everyone know on the corresponding JIRA page when > you start work. This is to avoid duplicate work. For small patches, you do > not need to get the JIRA assigned to you to begin work. > * For medium/large features or features with dependencies, please get > assigned first before coding and keep the ETA updated on the JIRA. If there > is no activity on the JIRA page for a certain amount of time, the JIRA should > be released for other contributors. > * Do not claim multip
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740185#comment-15740185 ] Felix Cheung commented on SPARK-15581: -- re: Pipeline in R - certainly. opened https://issues.apache.org/jira/browse/SPARK-18822 to track. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.1.0 > > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of the Python API is to have feature parity > wit
[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740184#comment-15740184 ] Felix Cheung commented on SPARK-18813: -- I added a couple of JIRAs for R that can be found with [this query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC] > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | 1 | next release | Blocker | *must* | *must* | *must* | > | 2 | next release | Critical | *must* | yes, unless small | *best effort* | > | 3 | next release | Major | *must* | optional | *best effort* | > | 4 | next release | Minor | optional | no | maybe | > | 5 | next release | Trivial | optional | no | maybe | > | 6 | (empty) | (any) | yes | no | maybe | > | 7 | (empty) | (any) | no | no | maybe | > The *Category* in the table above has the following meaning: > 1. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. > 2-3. A committer has promised to see this issue to completion for the next > release. Contributions *will* receive attention. The issue may slip to the > next release if development is slower than expected. > 4-5. A committer has promised interest in this issue. Contributions *will* > receive attention. The issue may slip to another release. > 6. A committer has promised interest in this issue and should respond, but no > promises are made about priorities or releases. > 7. This issue is open for discussion, but it needs a committer to promise > interest to proceed. > h1. Instructions > h2. For contributors > Getting started > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time contributor, please always start with a small > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a larger feature. > Coordinating on JIRA > * Never work silently. Let everyone know on the corresponding JIRA page when > you start work. This is to avoid duplicate work. For small patches, you do > not need to get the JIRA assigned to you to begin work. > * For medium/large features or features with dependencies, please get > assigned first before coding and keep the ETA updated on the JIRA. If there > is no activity on the JIRA page for a certain amount of time, the JIRA should > be released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Do not set these fields: Target Version, Fix Version, or Shepherd. Only > Committers should set those. > Writing and reviewing PRs > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours.* > h2. For Committers > Adding to this roadmap > * You can update the
[jira] [Commented] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740181#comment-15740181 ] Felix Cheung commented on SPARK-18822: -- I'll take a shot at this. > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18822) Support ML Pipeline in SparkR
Felix Cheung created SPARK-18822: Summary: Support ML Pipeline in SparkR Key: SPARK-18822 URL: https://issues.apache.org/jira/browse/SPARK-18822 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Felix Cheung >From Joseph Bradley: " Supporting Pipelines and advanced use cases: There really needs to be more design discussion around SparkR. Felix Cheung would you be interested in leading some discussion? I'm envisioning something similar to what was done a while back for Pipelines in Scala/Java/Python, where we consider several use cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, and working with multiple languages. That should help inform what APIs should look like in Spark R. " -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18821) Bisecting k-means wrapper in SparkR
Felix Cheung created SPARK-18821: Summary: Bisecting k-means wrapper in SparkR Key: SPARK-18821 URL: https://issues.apache.org/jira/browse/SPARK-18821 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Felix Cheung Implement a wrapper in SparkR to support bisecting k-means -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740172#comment-15740172 ] Felix Cheung commented on SPARK-18332: -- [~josephkb] they are because of the {code}@aliases{code} tags. I think we are adding them because CRAN checks require them to match the specific format - [~shivaram] would you know? I am pretty sure they are doubledly listed because in addition to aliases we also have {code}@rdname{code} which automatically generate the links as well. I suspect if we change all the rdname to match the string in aliases then there will be one link. I can take a shot at this to test this out, but changes will be very extensive - is this something we could get into 2.1 still? > SparkR 2.1 QA: Programming guide, migration guide, vignettes updates > > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide, its > migration guide, and the R vignettes. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Update R vignettes > Note: This task is for large changes to the guides. New features are handled > in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18226) SparkR displaying vector columns in incorrect way
[ https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Kalyan updated SPARK-18226: --- Component/s: SparkR > SparkR displaying vector columns in incorrect way > - > > Key: SPARK-18226 > URL: https://issues.apache.org/jira/browse/SPARK-18226 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Grzegorz Chilkiewicz >Priority: Trivial > > I have encountered a problem with SparkR presenting Spark vectors from > org.apache.spark.mllib.linalg package > * `head(df)` shows in vector column: "" > * cast to string does not work as expected, it shows: > "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]" > * `showDF(df)` work correctly > to reproduce, start SparkR and paste following code (example taken from > https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model) > {code} > # Fit a Bernoulli naive Bayes model with spark.naiveBayes > titanic <- as.data.frame(Titanic) > titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5]) > nbDF <- titanicDF > nbTestDF <- titanicDF > nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age) > # Model summary > summary(nbModel) > # Prediction > nbPredictions <- predict(nbModel, nbTestDF) > # > # My modification to expose the problem # > nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string") > head(nbPredictions) > showDF(nbPredictions) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18226) SparkR displaying vector columns in incorrect way
[ https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Kalyan updated SPARK-18226: --- Component/s: (was: SparkR) > SparkR displaying vector columns in incorrect way > - > > Key: SPARK-18226 > URL: https://issues.apache.org/jira/browse/SPARK-18226 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Grzegorz Chilkiewicz >Priority: Trivial > > I have encountered a problem with SparkR presenting Spark vectors from > org.apache.spark.mllib.linalg package > * `head(df)` shows in vector column: "" > * cast to string does not work as expected, it shows: > "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]" > * `showDF(df)` work correctly > to reproduce, start SparkR and paste following code (example taken from > https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model) > {code} > # Fit a Bernoulli naive Bayes model with spark.naiveBayes > titanic <- as.data.frame(Titanic) > titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5]) > nbDF <- titanicDF > nbTestDF <- titanicDF > nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age) > # Model summary > summary(nbModel) > # Prediction > nbPredictions <- predict(nbModel, nbTestDF) > # > # My modification to expose the problem # > nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string") > head(nbPredictions) > showDF(nbPredictions) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
[ https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739970#comment-15739970 ] Shuai Lin commented on SPARK-18820: --- The driver first sends {{RegisteredExecutor}} message and then, if there is a task scheduled to run on this executor, sends the {{LaunchTask}} message, both through the same underlying netty channel. So I think the order is guaranteed, and the problem described would never happen. > Driver may send "LaunchTask" before executor receive "RegisteredExecutor" > - > > Key: SPARK-18820 > URL: https://issues.apache.org/jira/browse/SPARK-18820 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.3 > Environment: spark-1.6.3 >Reporter: jin xing > > CoarseGrainedSchedulerBackend will update executorDataMap after receiving > "RegisterExecutor", thus task scheduler may assign tasks on to this executor; > If LaunchTask arrives at CoarseGrainedExecutorBackend before > RegisteredExecutor, it will result in NullPointerException and executor > backend will exit; > Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" > after "RegisteredExecutor" is already received. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
[ https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-18820: - Description: CoarseGrainedSchedulerBackend will update executorDataMap after receiving "RegisterExecutor", thus task scheduler may assign tasks on to this executor; If LaunchTask arrives at CoarseGrainedExecutorBackend before RegisteredExecutor, it will result in NullPointerException and executor backend will exit; Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" after "RegisteredExecutor" is already received. was: CoarseGrainedSchedulerBackend will update executorDataMap after receiving "RegisterExecutor", thus task scheduler may assign tasks on to this executor; If LaunchTask arrives at CoarseGrainedExecutorBackend before RegisteredExecutor, it will result in NullPointerException and executor backend will exit; Is it a bug? I think driver should send "LaunchTask" after "RegisteredExecutor" is already received. > Driver may send "LaunchTask" before executor receive "RegisteredExecutor" > - > > Key: SPARK-18820 > URL: https://issues.apache.org/jira/browse/SPARK-18820 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.3 > Environment: spark-1.6.3 >Reporter: jin xing > > CoarseGrainedSchedulerBackend will update executorDataMap after receiving > "RegisterExecutor", thus task scheduler may assign tasks on to this executor; > If LaunchTask arrives at CoarseGrainedExecutorBackend before > RegisteredExecutor, it will result in NullPointerException and executor > backend will exit; > Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" > after "RegisteredExecutor" is already received. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
jin xing created SPARK-18820: Summary: Driver may send "LaunchTask" before executor receive "RegisteredExecutor" Key: SPARK-18820 URL: https://issues.apache.org/jira/browse/SPARK-18820 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.6.3 Environment: spark-1.6.3 Reporter: jin xing CoarseGrainedSchedulerBackend will update executorDataMap after receiving "RegisterExecutor", thus task scheduler may assign tasks on to this executor; If LaunchTask arrives at CoarseGrainedExecutorBackend before RegisteredExecutor, it will result in NullPointerException and executor backend will exit; Is it a bug? I think driver should send "LaunchTask" after "RegisteredExecutor" is already received. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
[ https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739526#comment-15739526 ] Mohit edited comment on SPARK-18642 at 12/11/16 10:42 AM: -- [~dongjoon] We will appreciate if you could share your findings in form of 'touch-points' from the source-code. was (Author: mohitgargk): [~dongjoon] Please share your findings in form of 'touch-points' from the source-code. > Spark SQL: Catalyst is scanning undesired columns > - > > Key: SPARK-18642 > URL: https://issues.apache.org/jira/browse/SPARK-18642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 > Environment: Ubuntu 14.04 > Spark: Local Mode >Reporter: Mohit > Labels: performance > Fix For: 2.0.0 > > > When doing a left-join between two tables, say A and B, Catalyst has > information about the projection required for table B. Only the required > columns should be scanned. > Code snippet below explains the scenario: > scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") > dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] > scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") > dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] > scala> dfA.registerTempTable("A") > scala> dfB.registerTempTable("B") > scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid > where B.bid<2").explain > == Physical Plan == > Project [aid#15,bid#17] > +- Filter (bid#17 < 2) >+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None > :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: > file:/home/mohit/ruleA > +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: > file:/home/mohit/ruleB > This is a watered-down example from a production issue which has a huge > performance impact. > External reference: > http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
[ https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739526#comment-15739526 ] Mohit commented on SPARK-18642: --- [~dongjoon] Please share your findings in form of 'touch-points' from the source-code. > Spark SQL: Catalyst is scanning undesired columns > - > > Key: SPARK-18642 > URL: https://issues.apache.org/jira/browse/SPARK-18642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 > Environment: Ubuntu 14.04 > Spark: Local Mode >Reporter: Mohit > Labels: performance > Fix For: 2.0.0 > > > When doing a left-join between two tables, say A and B, Catalyst has > information about the projection required for table B. Only the required > columns should be scanned. > Code snippet below explains the scenario: > scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") > dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] > scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") > dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] > scala> dfA.registerTempTable("A") > scala> dfB.registerTempTable("B") > scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid > where B.bid<2").explain > == Physical Plan == > Project [aid#15,bid#17] > +- Filter (bid#17 < 2) >+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None > :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: > file:/home/mohit/ruleA > +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: > file:/home/mohit/ruleB > This is a watered-down example from a production issue which has a huge > performance impact. > External reference: > http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18196) Optimise CompactBuffer implementation
[ https://issues.apache.org/jira/browse/SPARK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18196. --- Resolution: Won't Fix For now looks like a "wontfix" as it doesn't result in a speedup. > Optimise CompactBuffer implementation > - > > Key: SPARK-18196 > URL: https://issues.apache.org/jira/browse/SPARK-18196 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Adam Roberts >Priority: Minor > > This change slightly increases the class footprint (8 bytes on IBM Java, 12 > bytes on OpenJDK and Oracle's) but we've observed a 4% performance on > PageRank using HiBench large with this change, so a worthy trade-off IMO > This results in a shorter path length for the JIT as a result of less if else > statements > Config used on HiBench > spark.executor.memory 25G > spark.driver.memory4G > spark.serializerorg.apache.spark.serializer.KryoSerializer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kamprath updated SPARK-18819: - Comment: was deleted (was: Possibly. I can dump the file created using [parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] on the ARM machines using the same java installation. I am assuming that this at least rules out the JVM, but not necessarily the parquet lib because I am using the latest snapshot of parquet to do the dump (which might not be the same as in spark 2.0.2). The fact that this problem arises with both HDFS and QFS as the file system rules out the file system itself, though not necessarily the spark interface to it. If this is not enough, I'll see what I can do to isolate it more.) > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) >
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739488#comment-15739488 ] Michael Kamprath commented on SPARK-18819: -- Possibly. I can dump the file created using [parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] on the ARM machines using the same java installation. I am assuming that this at least rules out the JVM, but not necessarily the parquet lib because I am using the latest snapshot of parquet to do the dump (which might not be the same as in spark 2.0.2). The fact that this problem arises with both HDFS and QFS as the file system rules out the file system itself, though not necessarily the spark interface to it. If this is not enough, I'll see what I can do to isolate it more. > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAG
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739487#comment-15739487 ] Michael Kamprath commented on SPARK-18819: -- Possibly. I can dump the file created using [parquet-tools|https://github.com/Parquet/parquet-mr/tree/master/parquet-tools] on the ARM machines using the same java installation. I am assuming that this at least rules out the JVM, but not necessarily the parquet lib because I am using the latest snapshot of parquet to do the dump (which might not be the same as in spark 2.0.2). The fact that this problem arises with both HDFS and QFS as the file system rules out the file system itself, though not necessarily the spark interface to it. If this is not enough, I'll see what I can do to isolate it more. > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAG
[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow
[ https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739460#comment-15739460 ] Sean Owen commented on SPARK-18750: --- I'm going to close this as a duplicate of SPARK-18769 unless there's evidence that this is an error from Spark, and can be patched separately from the apparent underlying cause, which is in that JIRA. > spark should be able to control the number of executor and should not throw > stack overslow > -- > > Key: SPARK-18750 > URL: https://issues.apache.org/jira/browse/SPARK-18750 > Project: Spark > Issue Type: Bug >Reporter: Neerja Khattar > > When running Sql queries on large datasets. Job fails with stack overflow > warning and it shows it is requesting lots of executors. > Looks like there is no limit to number of executors or not even having an > upperbound based on yarn available resources. > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n5.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n8.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n2.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of > 32770 executor(s). > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor > containers, each with 1 cores and 6758 MB memory including 614 MB overhead > 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of > 52902 executor(s). > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n5.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n8.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : > bdtcstr61n2.svr.us.jpmchase.net:8041 > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of > 32770 executor(s). > 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor > containers, each with 1 cores and 6758 MB memory including 614 MB overhead > 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of > 52902 executor(s). > 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 > time(s) in a row. > java.lang.StackOverflowError > at scala.collection.immutable.HashMap.$plus(HashMap.scala:57) > at scala.collection.immutable.HashMap.$plus(HashMap.scala:36) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24) > at > scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156) > at > scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105) > at scala.collection.immutable.HashMap.$plus(HashMap.scala:60) > at scala.collection.immutable.Map$Map4.updated(Map.scala:172) > at scala.collection.immutable.Map$Map4.$plus(Map.scala:173) > at scala.collection.immutable.Map$Map4.$plus(Map.scala:158) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28) > at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24) > at > scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) >
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739458#comment-15739458 ] Sean Owen commented on SPARK-18819: --- Surely, this is specific to ARM if it doesn't occur on x86? I doubt it has anything to do with Parquet per se. I have no particular reason to believe ARM doesn't work, but also doubt it's been tested or is supported. This still just contains the driver stack trace, which says "something went wrong over there". It's not even clear the failure is from Spark. > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSched
[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kamprath updated SPARK-18819: - Description: When I create a data frame in PySpark with a small row count (less than number executors), then write it to a parquet file, then load that parquet file into a new data frame, and finally do any sort of read against the loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. Example code to replicate this issue: {code} from pyspark.sql.types import * rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) my_schema = StructType([ StructField("id", StringType(), True), StructField("value1", IntegerType(), True), StructField("value2", DoubleType(), True), StructField("name",StringType(), True) ]) df = spark.createDataFrame( rdd, schema=my_schema) df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') newdf.take(1) {code} The error I get when the {{take}} step runs is: {code} --- Py4JJavaError Traceback (most recent call last) in () 1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > 2 newdf.take(1) /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 347 """ --> 348 return self.limit(num).collect() 349 350 @since(1.3) /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) 308 """ 309 with SCCallSiteSync(self._sc) as css: --> 310 port = self._jdf.collectToPython() 311 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 312 /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o54.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) at org.apache.spark.SparkContext.run
[jira] [Comment Edited] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739452#comment-15739452 ] Michael Kamprath edited comment on SPARK-18819 at 12/11/16 9:42 AM: Sure. I updated the description above. was (Author: kamprath): Sure. The complete error message is: {{code}} --- Py4JJavaError Traceback (most recent call last) in () 1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > 2 newdf.take(1) /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 347 """ --> 348 return self.limit(num).collect() 349 350 @since(1.3) /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) 308 """ 309 with SCCallSiteSync(self._sc) as css: --> 310 port = self._jdf.collectToPython() 311 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 312 /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o54.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Data
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739452#comment-15739452 ] Michael Kamprath commented on SPARK-18819: -- Sure. The complete error message is: {{code}} --- Py4JJavaError Traceback (most recent call last) in () 1 newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > 2 newdf.take(1) /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 347 """ --> 348 return self.limit(num).collect() 349 350 @since(1.3) /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) 308 """ 309 with SCCallSiteSync(self._sc) as css: --> 310 port = self._jdf.collectToPython() 311 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 312 /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o54.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523) at sun.reflect.N
[jira] [Resolved] (SPARK-18653) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-18653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18653. --- Resolution: Won't Fix > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-18653 > URL: https://issues.apache.org/jira/browse/SPARK-18653 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > The following program generates incorrect space padding for > {{Dataset.show()}} since column name or column value has Unicode Character > Program > {code:java} > case class UnicodeCaseClass(整数: Int, 実数: Double, s: String) > val ds = Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS > ds.show > {code} > Output > {code} > +---+---++ > | 整数| 実数| s| > +---+---++ > | 1|1.1|文字列1| > +---+---++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kamprath updated SPARK-18819: - Description: When I create a data frame in PySpark with a small row count (less than number executors), then write it to a parquet file, then load that parquet file into a new data frame, and finally do any sort of read against the loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. Example code to replicate this issue: {code} from pyspark.sql.types import * rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) my_schema = StructType([ StructField("id", StringType(), True), StructField("value1", IntegerType(), True), StructField("value2", DoubleType(), True), StructField("name",StringType(), True) ]) df = spark.createDataFrame( rdd, schema=my_schema) df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') newdf.take(1) {code} The error I get when the {{take}} step runs is: {code} Py4JJavaError: An error occurred while calling o54.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) {code} I have tested this against HDFS 2.7 and QFS 1.2 on an ARM v7.1 based cluster. Both have the same results. Note I have verified this issue doesn't express on x86 platforms. The
[jira] [Updated] (SPARK-18628) Update handle invalid documentation string
[ https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18628: -- Assignee: Krishna Kalyan > Update handle invalid documentation string > -- > > Key: SPARK-18628 > URL: https://issues.apache.org/jira/browse/SPARK-18628 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Krishna Kalyan >Priority: Trivial > Labels: starter > Fix For: 2.1.1, 2.2.0 > > > The handleInvalid paramater documentation string currently doesn't have > quotes around the options, after SPARK-18366 is in, it would be good to > update both the Scala param and Python param to have quotes around the > options making it easier for users to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18628) Update handle invalid documentation string
[ https://issues.apache.org/jira/browse/SPARK-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18628. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16242 [https://github.com/apache/spark/pull/16242] > Update handle invalid documentation string > -- > > Key: SPARK-18628 > URL: https://issues.apache.org/jira/browse/SPARK-18628 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > Labels: starter > Fix For: 2.1.1, 2.2.0 > > > The handleInvalid paramater documentation string currently doesn't have > quotes around the options, after SPARK-18366 is in, it would be good to > update both the Scala param and Python param to have quotes around the > options making it easier for users to read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739432#comment-15739432 ] Sean Owen commented on SPARK-9487: -- I think this is going around in circles. You already have an open invitation to improve tests in any logical subset of the project in order to accomplish this change in number of worker threads. You're saying you are unable to get them to pass on Jenkins and unwilling to debug. I don't think there is more guidance to give here; either you can effect this change or not. If nobody can or seems willing to try, I think it should be closed, because this really isn't an error to start with, nor even that suboptimal (excepting that it has revealed a couple tests could be a little more robust) > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18809) Kinesis deaggregation issue on master
[ https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18809: -- Assignee: Brian ONeill > Kinesis deaggregation issue on master > - > > Key: SPARK-18809 > URL: https://issues.apache.org/jira/browse/SPARK-18809 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Brian ONeill >Assignee: Brian ONeill >Priority: Minor > Fix For: 2.2.0 > > > Fix for SPARK-14421 was never applied to master. > https://github.com/apache/spark/pull/16236 > Upgrade KCL to 1.6.2 to support deaggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18809) Kinesis deaggregation issue on master
[ https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18809: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Kinesis deaggregation issue on master > - > > Key: SPARK-18809 > URL: https://issues.apache.org/jira/browse/SPARK-18809 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Brian ONeill >Priority: Minor > Fix For: 2.2.0 > > > Fix for SPARK-14421 was never applied to master. > https://github.com/apache/spark/pull/16236 > Upgrade KCL to 1.6.2 to support deaggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18809) Kinesis deaggregation issue on master
[ https://issues.apache.org/jira/browse/SPARK-18809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18809. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16236 [https://github.com/apache/spark/pull/16236] > Kinesis deaggregation issue on master > - > > Key: SPARK-18809 > URL: https://issues.apache.org/jira/browse/SPARK-18809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Brian ONeill > Fix For: 2.2.0 > > > Fix for SPARK-14421 was never applied to master. > https://github.com/apache/spark/pull/16236 > Upgrade KCL to 1.6.2 to support deaggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18819) Failure to read single-row Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739422#comment-15739422 ] Sean Owen commented on SPARK-18819: --- This doesn't say anything about the underlying error though. Without that I think this would have to be closed as unactionable. Any more detail? > Failure to read single-row Parquet files > > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 8, 10.10.10.4): ExecutorLostFailure (executor 0 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2526) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2523) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) > at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
[jira] [Resolved] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension
[ https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18799. --- Resolution: Duplicate > Spark SQL expose interface for plug-gable parser extension > --- > > Key: SPARK-18799 > URL: https://issues.apache.org/jira/browse/SPARK-18799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jihong MA > > There used to be an interface to plug a parser extension through > ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x > release, Apache Spark moved to the new parser (Antlr4), there is no longer a > way to extend the default SQL parser through SparkSession interface, however > this is really a pain and hard to work around it when integrating other data > source with Spark with extended support such as Insert, Update, Delete > statement or any other data management statement. > It would be very nice to continue to expose an interface for parser extension > to make data source integration easier and smoother. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
[ https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15739415#comment-15739415 ] Sean Owen commented on SPARK-18786: --- I agree it's surprising and maybe fixable, but this may be in the category of things you just shouldn't do. You generally do not stop() a SparkContext except at the end of a program. > pySpark SQLContext.getOrCreate(sc) take stopped sparkContext > > > Key: SPARK-18786 > URL: https://issues.apache.org/jira/browse/SPARK-18786 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Alex Liu > > The following steps to reproduce the issue > {code} > import sys > sys.path.insert(1, 'spark/python/') > sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip') > from pyspark import SparkContext, SQLContext > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > sc.stop() > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > {code} > It has the following errors after the last command > {code} > >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > Traceback (most recent call last): > > File "", line 1, in > File "spark/python/pyspark/sql/dataframe.py", line 257, in show > print(self._jdf.showString(n, truncate)) > File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in > __call__ > File "spark/python/pyspark/sql/utils.py", line 45, in deco > return f(*a, **kw) > File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in > get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString. > : java.lang.IllegalStateException: Cannot call methods on a stopped > SparkContext. > This stopped SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > The currently active SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > > at > org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$an
[jira] [Updated] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
[ https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18786: --- Component/s: PySpark > pySpark SQLContext.getOrCreate(sc) take stopped sparkContext > > > Key: SPARK-18786 > URL: https://issues.apache.org/jira/browse/SPARK-18786 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Alex Liu > > The following steps to reproduce the issue > {code} > import sys > sys.path.insert(1, 'spark/python/') > sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip') > from pyspark import SparkContext, SQLContext > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > sc.stop() > sc = SparkContext.getOrCreate() > sqlContext = SQLContext.getOrCreate(sc) > sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > {code} > It has the following errors after the last command > {code} > >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show() > Traceback (most recent call last): > > File "", line 1, in > File "spark/python/pyspark/sql/dataframe.py", line 257, in show > print(self._jdf.showString(n, truncate)) > File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in > __call__ > File "spark/python/pyspark/sql/utils.py", line 45, in deco > return f(*a, **kw) > File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in > get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString. > : java.lang.IllegalStateException: Cannot call methods on a stopped > SparkContext. > This stopped SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > The currently active SparkContext was created at: > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59) > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > java.lang.reflect.Constructor.newInstance(Constructor.java:422) > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > py4j.Gateway.invoke(Gateway.java:214) > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > py4j.GatewayConnection.run(GatewayConnection.java:209) > java.lang.Thread.run(Thread.java:745) > > at > org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) > at
[jira] [Updated] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18710: Shepherd: Yanbo Liang (was: Sean Owen) Remaining Estimate: 10h (was: 336h) Original Estimate: 10h (was: 336h) > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang > Labels: features > Fix For: 2.2.0 > > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org