[jira] [Created] (SPARK-12302) Example for servlet filter used by spark.ui.filters
Kai Sasaki created SPARK-12302: -- Summary: Example for servlet filter used by spark.ui.filters Key: SPARK-12302 URL: https://issues.apache.org/jira/browse/SPARK-12302 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.5.2 Reporter: Kai Sasaki Priority: Trivial Although {{spark.ui.filters}} configuration uses simple servlet filter, it is often difficult to understand how to write filter code and how to integrate actual spark applications. It can be help to write examples for trying secure Spark cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11938) Expose numFeatures in all ML PredictionModel for PySpark
[ https://issues.apache.org/jira/browse/SPARK-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044673#comment-15044673 ] Kai Sasaki commented on SPARK-11938: [~mengxr] [~yanboliang] Sorry for bothering you, but could you review this if possible? Thank you. > Expose numFeatures in all ML PredictionModel for PySpark > > > Key: SPARK-11938 > URL: https://issues.apache.org/jira/browse/SPARK-11938 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > SPARK-9715 provide support for numFeatures in all ML PredictionModel, we > should expose it at Python side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035194#comment-15035194 ] Kai Sasaki commented on SPARK-11520: [~mengxr] [~josephkb] Could review this if possible? > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012964#comment-15012964 ] Kai Sasaki commented on SPARK-11520: [~josephkb] The metrics of {{RegressionMetrics}} seems to be based on {{MultivariateStatisticalSummary}}. And current {{RegressionMetrics}} does not support weighted samples as argument. So we can pass the weighted samples to MultivariateStatisticalSummary ({{MultivariateOnlineSummarizer}}) and calculate metrics for regression metrics. Is this assumption correct? Can I work on this JIRA, if possible? > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013068#comment-15013068 ] Kai Sasaki edited comment on SPARK-4036 at 11/19/15 7:32 AM: - [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work on this JIRA instead of me. Anyway, can I give a check and comment to your patch? was (Author: lewuathe): [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work in this JIRA instead of me. Anyway, can I give a check and comment to your patch? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013068#comment-15013068 ] Kai Sasaki commented on SPARK-4036: --- [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work in this JIRA instead of me. Anyway, can I give a check and comment to your patch? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:48 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:48 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with `glm`. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki commented on SPARK-11439: [~nakul02] It seems to indicate the model in SparkR here. According to this documentation, you can create SparkR linear model with `glm`. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:47 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with `glm`. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here. According to this documentation, you can create SparkR linear model with `glm`. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:49 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second argument. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second parameter. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:50 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second argument. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) // You should replace with generated data fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) $devianceResiduals Min Max -1.307112 1.412532 $coefficients Estimate Std. Error t value Pr(>|t|) (Intercept)2.251393 0.3697543 6.08889 9.568102e-09 Sepal_Width0.8035609 0.106339 7.556598 4.187317e-12 Species_versicolor 1.458743 0.1121079 13.01195 0 Species_virginica 1.946817 0.100015 19.46525 0 {code} In my environment, it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second argument. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384 ] Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:51 AM: -- [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second argument. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) // You should replace with generated data fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) $devianceResiduals Min Max -1.307112 1.412532 $coefficients Estimate Std. Error t value Pr(>|t|) (Intercept)2.251393 0.3697543 6.08889 9.568102e-09 Sepal_Width0.8035609 0.106339 7.556598 4.187317e-12 Species_versicolor 1.458743 0.1121079 13.01195 0 Species_virginica 1.946817 0.100015 19.46525 0 {code} In my environment (running on bin/sparkR), it seems to work. was (Author: lewuathe): [~nakul02] It seems to indicate the model in SparkR here not gmlnet. According to this documentation, you can create SparkR linear model with {{glm}} function. https://spark.apache.org/docs/latest/sparkr.html#machine-learning This will call {{SparkRWrapper#fitRModelFormula}}. It returns LinearRegressionModel with Pipeline when it receives "gaussian" as second argument. So in summary we can write the code like this to use {{LinearRegressionModel}} in SparkR. {code} df <- createDataFrame(sqlContext, iris) // You should replace with generated data fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") summary(fit) $devianceResiduals Min Max -1.307112 1.412532 $coefficients Estimate Std. Error t value Pr(>|t|) (Intercept)2.251393 0.3697543 6.08889 9.568102e-09 Sepal_Width0.8035609 0.106339 7.556598 4.187317e-12 Species_versicolor 1.458743 0.1121079 13.01195 0 Species_virginica 1.946817 0.100015 19.46525 0 {code} In my environment, it seems to work. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11717) Ignore R session and history files from git
Kai Sasaki created SPARK-11717: -- Summary: Ignore R session and history files from git Key: SPARK-11717 URL: https://issues.apache.org/jira/browse/SPARK-11717 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Kai Sasaki Priority: Trivial SparkR generates R session data and history files under current directory. It might be useful to ignore these files even running SparkR on spark directory for test or development. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11439) Optiomization of creating sparse feature without dense one
Kai Sasaki created SPARK-11439: -- Summary: Optiomization of creating sparse feature without dense one Key: SPARK-11439 URL: https://issues.apache.org/jira/browse/SPARK-11439 Project: Spark Issue Type: Improvement Components: ML Reporter: Kai Sasaki Priority: Minor Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent generating sparse feature without generating dense vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-11439: --- Description: Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent from generating dense feature when creating sparse features. (was: Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent generating sparse feature without generating dense vectors.) > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid
[ https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981720#comment-14981720 ] Kai Sasaki commented on SPARK-11223: Yes. I think so too. But I wonder the purpose of this is only for debugging. If so only printing parameters might be sufficient. > PySpark CrossValidatorModel does not output metrics for every param in > paramGrid > > > Key: SPARK-11223 > URL: https://issues.apache.org/jira/browse/SPARK-11223 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Raela Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973214#comment-14973214 ] Kai Sasaki commented on SPARK-11239: [~holdenk] Hi, these tickets under SPARK-11171 are blocked by SPARK-11241? > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11234) What's cooking classification
[ https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973221#comment-14973221 ] Kai Sasaki commented on SPARK-11234: [~xusen] Thank you so much for very insightful experiments! {quote} 4. The evaluator forces me to select a metric method. But sometimes I want to see all the evaluation results, say F1, precision-recall, AUC, etc. {quote} Yes, I agree with you. At the initial phase to run machine learning algorithm, it is the case when we don't know what metrics we should see. {quote} 5. ML transformers will get stuck when facing with Int type. It's strange that we have to transform all Int values to double values before hand. I think a wise auto casting is helpful. {quote} What kind of Transformer got stuck? The first transformer cannot handle input int values, do you think? > What's cooking classification > - > > Key: SPARK-11234 > URL: https://issues.apache.org/jira/browse/SPARK-11234 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin > > I add the subtask to post the work on this dataset: > https://www.kaggle.com/c/whats-cooking -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973745#comment-14973745 ] Kai Sasaki commented on SPARK-7146: --- There are several times when I want to use internal resources(e.g. shared params, optimization) of Spark as our own library or framework. The task to write these code again often cause trouble and long time development. In addition to this, as you said, there might be the several implementations which has the same name but different functionality. {quote} Cons: Users have to be careful since parameters can have different meanings for different algorithms. {quote} I think this is also true when even {{sharedParams}} is private because application developers will implements their own params which retain almost same name with {{sharedParams}}. It becomes confusing. So basically it might be better to enable developers to use {{sharedParams}} inside their own frameworks. It does not mean that making it public directly. As [~josephkb] proposed on (b), it is good way to make it open for developers but create some restrictions. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Discussion: Should the Param traits in sharedParams.scala be public? > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > Proposal: Either > (a) make the shared params private to encourage users to write specialized > documentation and value checks for parameters, or > (b) design a better way to encourage overriding documentation and parameter > value checks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11207) Add test cases for normal LinearRegression solver as followup.
Kai Sasaki created SPARK-11207: -- Summary: Add test cases for normal LinearRegression solver as followup. Key: SPARK-11207 URL: https://issues.apache.org/jira/browse/SPARK-11207 Project: Spark Issue Type: Improvement Components: ML Reporter: Kai Sasaki This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.
[ https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-11207: --- Summary: Add test cases for solver selection of LinearRegression as followup. (was: Add test cases for normal LinearRegression solver as followup.) > Add test cases for solver selection of LinearRegression as followup. > > > Key: SPARK-11207 > URL: https://issues.apache.org/jira/browse/SPARK-11207 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki > Labels: ML > > This is the follow up work of SPARK-10668. > * Fix miner style issues. > * Add test case for checking whether solver is selected properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small
[ https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904395#comment-14904395 ] Kai Sasaki commented on SPARK-10668: So sorry for being late for submitting patch and thank you for supporting me. [~mengxr] [~yanbo] Could you review current patch? > Use WeightedLeastSquares in LinearRegression with L2 regularization if the > number of features is small > -- > > Key: SPARK-10668 > URL: https://issues.apache.org/jira/browse/SPARK-10668 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki >Priority: Critical > > If the number of features is small (<=4096) and the regularization is L2, we > should use WeightedLeastSquares to solve the problem rather than L-BFGS. The > former requires only one pass to the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare
[ https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-10715: --- Labels: ML (was: ) > Duplicate initialzation flag in WeightedLeastSquare > --- > > Key: SPARK-10715 > URL: https://issues.apache.org/jira/browse/SPARK-10715 > Project: Spark > Issue Type: Bug >Reporter: Kai Sasaki >Priority: Trivial > Labels: ML > > There are duplicate set of initialization flag in > {{WeightedLeastSquares#add}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare
Kai Sasaki created SPARK-10715: -- Summary: Duplicate initialzation flag in WeightedLeastSquare Key: SPARK-10715 URL: https://issues.apache.org/jira/browse/SPARK-10715 Project: Spark Issue Type: Bug Reporter: Kai Sasaki Priority: Trivial There are duplicate set of initialization flag in {{WeightedLeastSquares#add}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing
[ https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876767#comment-14876767 ] Kai Sasaki commented on SPARK-10709: [~yhuai] So do you mean this error message should clarify the difference between that the path does not exist and that path is not passed to `path` parameter? > When loading a json dataset as a data frame, if the input path is wrong, the > error message is very confusing > > > Key: SPARK-10709 > URL: https://issues.apache.org/jira/browse/SPARK-10709 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > If you do something like {{sqlContext.read.json("a wrong path")}}, when we > actually read data, the error message is > {code} > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.ShuffleDependency.(Dependency.scala:91) > at > org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59) > at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226) > at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224) > at >
[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small
[ https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876787#comment-14876787 ] Kai Sasaki commented on SPARK-10668: [~mengxr] Hello, can I work on this JIRA? Please assign it to me. Thank you. > Use WeightedLeastSquares in LinearRegression with L2 regularization if the > number of features is small > -- > > Key: SPARK-10668 > URL: https://issues.apache.org/jira/browse/SPARK-10668 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > If the number of features is small (<=4096) and the regularization is L2, we > should use WeightedLeastSquares to solve the problem rather than L-BFGS. The > former requires only one pass to the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10388) Public dataset loader interface
[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802747#comment-14802747 ] Kai Sasaki commented on SPARK-10388: [~mengxr] I totally agree with you. The initial version should be minimal and simple. So the previous suggestion is just for desired features. In this sense, the initial suggestion might be sufficient as MVP. {quote} For example, I don't think json and orc are commonly used for ML datasets. {quote} Yes, json or orc are not used for machine learning data. I just think public dataset loader should be flexible to later extension. That means other dataset format can be added as plugin. {quote} A proper implementation would be implementing HTTP as a Hadoop FileSystem. {quote} Does it mean public dataset can be used through RDD directly? For example we can use {{val data = sc.textFile( // public dataset url )}} > Public dataset loader interface > --- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10388) Public dataset loader interface
[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745430#comment-14745430 ] Kai Sasaki commented on SPARK-10388: It seems very useful for the beginners who want to try Spark ML on their projects and who want to see the behaviour of Pipeline API. I have several comments. * It might be better to do lazy download. Some datasets are very large, so it will be good to download them when it is realy needed. In above example, datasets are downloaded at {{datasets.show()}}. * Once datasets are downloaded, it will be better to cache these data at the local. And it requires repository API to publicate the latest update. Therefore public dataset loader can update its local cache properly. * I agree with the idea to allow 3rd-party to create their repositories. It requires to fix the design of repository itself. We can create the specification and also some SDK if possible. (Should these be included Spark projects?) * We should not restrict the format which public dataset loader can load. Current {{DataFrameReader}} can read such as json, libsvm or orc. There might be various kind of format at the public. So it may be reasonable to support also these kind of format which is currently not supported in future. * Although this is a little whim, integration between public dataset loader and kaggle datasets increases the use cases of Spark ML. In general, searching data and loading data are troublesome. This feature makes it easier for developers. I want to help this design and implementation. Thank you. > Public dataset loader interface > --- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10055) San Francisco Crime Classification
[ https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711298#comment-14711298 ] Kai Sasaki commented on SPARK-10055: I submitted the initial version of this competition. Although the score is not good, there are several points I found in using Spark ML API. There might be something which is just caused by my lack of knowledge of Spark ML. So if we can already solve with existing code, please let me know. * There does not seem to be {{Transformer}} which can cast type of columns. In this case, {{X}} and {{Y}} are String as default when read by [spark-csv|http://spark-packages.org/package/databricks/spark-csv]. In order to use {{StandardScaler}} to {{X}} and {{Y}}, they must be numeric types. I cannot do that with Spark ML `Transformer`. Fortunately, {{spark-csv}} can infer types of schema to reading all data once. But in case of no such option in reading library, I think it is better to cast column types in Spark ML pipeline. * {{StringIndexer}} exports its labels in order by frequencies. But in this competition, we have to write in alphabetical order. We have to write some extra code to convert frequency order labels to alphabetical order. * {{StandardScaler}} can only receive vector data as its own input. In this case, I want to scale {{X}} and {{Y}} with {{StandardScaler}}. But these are simple double data, it is necessary to assemble these values into feature vector. Is there some case to use `StandardScaler` to simple Int data or Double data? We have to assemble these data into a feature vector before scaling? The code is [here|https://github.com/Lewuathe/kaggle-jobs/blob/master/src/main/scala/com/lewuathe/SfCrimeClassification.scala]. Thank you. San Francisco Crime Classification -- Key: SPARK-10055 URL: https://issues.apache.org/jira/browse/SPARK-10055 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xusen Yin Apply ML pipeline API to San Francisco Crime Classification (https://www.kaggle.com/c/sf-crime). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10117) Implement SQL data source API for reading LIBSVM data
[ https://issues.apache.org/jira/browse/SPARK-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711186#comment-14711186 ] Kai Sasaki commented on SPARK-10117: [~mengxr] If possible, can I work on this JIRA? Thank you! Implement SQL data source API for reading LIBSVM data - Key: SPARK-10117 URL: https://issues.apache.org/jira/browse/SPARK-10117 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. {code} import org.apache.spark.ml.source.libsvm._ val training = sqlContext.read .format(libsvm) .option(numFeatures, 1) .load(path) {code} This JIRA covers the following: 1. Read LIBSVM data as a DataFrame with two columns: label: Double and features: Vector. 2. Accept `numFeatures` as an option. 3. The implementation should live under `org.apache.spark.ml.source.libsvm`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10110) StringIndexer lacks of parameter handleInvalid.
Kai Sasaki created SPARK-10110: -- Summary: StringIndexer lacks of parameter handleInvalid. Key: SPARK-10110 URL: https://issues.apache.org/jira/browse/SPARK-10110 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Kai Sasaki Fix For: 1.5.0 Missing API for pyspark {{StringIndexer.handleInvalid}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10111) StringIndexerModel lacks of method labels
Kai Sasaki created SPARK-10111: -- Summary: StringIndexerModel lacks of method labels Key: SPARK-10111 URL: https://issues.apache.org/jira/browse/SPARK-10111 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Kai Sasaki Missing {{labels}} property of {{StringIndexer}} in pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10111) StringIndexerModel lacks of method labels
[ https://issues.apache.org/jira/browse/SPARK-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-10111: --- Description: Missing {{labels}} property of {{StringIndexerModel}} in pyspark. (was: Missing {{labels}} property of {{StringIndexer}} in pyspark.) StringIndexerModel lacks of method labels --- Key: SPARK-10111 URL: https://issues.apache.org/jira/browse/SPARK-10111 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Kai Sasaki Missing {{labels}} property of {{StringIndexerModel}} in pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10027) Add Python API missing methods for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702861#comment-14702861 ] Kai Sasaki commented on SPARK-10027: [~yanbo] Can I work on this JIRA? Thank you. Add Python API missing methods for ml.feature - Key: SPARK-10027 URL: https://issues.apache.org/jira/browse/SPARK-10027 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang Missing method of ml.feature are listed here: * StringIndexer lacks of parameter handleInvalid. * StringIndexerModel lacks of method labels. * VectorIndexerModel lacks of methods numFeatures and categoryMaps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10012) Missing test case for Params#arrayLengthGt
[ https://issues.apache.org/jira/browse/SPARK-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-10012: --- Summary: Missing test case for Params#arrayLengthGt (was: Missing test case for ParamsarrayLengthGt) Missing test case for Params#arrayLengthGt -- Key: SPARK-10012 URL: https://issues.apache.org/jira/browse/SPARK-10012 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 1.5.0 Reporter: Kai Sasaki Priority: Trivial Currently there is no test case for {{Params#arrayLengthGt}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10012) Missing test case for ParamsarrayLengthGt
Kai Sasaki created SPARK-10012: -- Summary: Missing test case for ParamsarrayLengthGt Key: SPARK-10012 URL: https://issues.apache.org/jira/browse/SPARK-10012 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 1.5.0 Reporter: Kai Sasaki Priority: Trivial Currently there is no test case for {{Params#arrayLengthGt}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10009) PySpark Param of Vector type can be set with Python array or numpy.array
[ https://issues.apache.org/jira/browse/SPARK-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698515#comment-14698515 ] Kai Sasaki commented on SPARK-10009: [~yanbo] Current ML models parameters look like must be set as keyword dictionary type. Specifically, what type of parameters can be set as vector types? PySpark Param of Vector type can be set with Python array or numpy.array Key: SPARK-10009 URL: https://issues.apache.org/jira/browse/SPARK-10009 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang If the type of Param in PySpark ML pipeline is Vector, we can set with Vector currently. We also need to support set it with Python array and numpy.array. It should be handled in the wrapper (_transfer_params_to_java). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10002) SSH problem during Setup of Spark(1.3.0) cluster on EC2
[ https://issues.apache.org/jira/browse/SPARK-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698516#comment-14698516 ] Kai Sasaki commented on SPARK-10002: [~deepalib] Can you login via SSH not ping? Ping does not confirm the state of SSH port. So it might be the problem of security group. SSH problem during Setup of Spark(1.3.0) cluster on EC2 --- Key: SPARK-10002 URL: https://issues.apache.org/jira/browse/SPARK-10002 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Environment: EC2, SPARK 1.3.0 cluster setup in vpc/subnet. Reporter: Deepali Bhandari Steps to start a Spark cluster with EC2 scripts 1. I created an ec2 instance in the vpc, and subnet. Amazon Linux 2. I dowloaded spark-1.3.0 3. chmod 400 key file 4. Export aws access and secret keys 5. Now ran the command ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home/ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b66 --subnet-id=subnet-72fd5905 --resume launch deepali-spark-nodocker 6. The master and slave instances are created but cannot ssh says host not resolved. 7. I can ping the master and slave, I can ssh from the command line, but not from the ec2 scripts. 8. I have spent more than 2 days now. But no luck yet. 9. The ec2 scripts dont work .. code has a bug in referencing the cluster nodes via the wrong hostnames SCREEN CONSOLE log ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home /ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b6 6 --subnet-id=subnet-72fd5905 launch deepali-spark-nodocker Downloading Boto from PyPi Finished downloading Boto Setting up security groups... Creating security group deepali-spark-nodocker-master Creating security group deepali-spark-nodocker-slaves Searching for existing cluster deepali-spark-nodocker... Spark AMI: ami-9a6e0daa Launching instances... Launched 1 slaves in us-west-2b, regid = r-0d2088fb Launched master in us-west-2b, regid = r-312088c7 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: None SSH return code: 255 SSH output: ssh: Could not resolve hostname None: Name or service not known -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer
Kai Sasaki created SPARK-9977: - Summary: The usage of a label generated by StringIndexer Key: SPARK-9977 URL: https://issues.apache.org/jira/browse/SPARK-9977 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Kai Sasaki Priority: Trivial By using {{StringIndexer}}, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9841) Params.clear needs to be public
[ https://issues.apache.org/jira/browse/SPARK-9841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695103#comment-14695103 ] Kai Sasaki commented on SPARK-9841: --- [~josephkb] Do you have any use case when public clear method is useful? I think parameters must have been set through training. When we want to reset parameter and create model again, it should be only necessary to train estimator again. If there are several cases when public clear method is needed, {{set}} method also should be public, I think. Is it correct? Params.clear needs to be public --- Key: SPARK-9841 URL: https://issues.apache.org/jira/browse/SPARK-9841 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley It is currently impossible to clear Param values once set. It would be helpful to be able to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent
[ https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629527#comment-14629527 ] Kai Sasaki commented on SPARK-9073: --- [~josephkb] Hi, if possible, can I work on this JIRA? Thank you. spark.ml Models copy() should call setParent when there is a parent --- Key: SPARK-9073 URL: https://issues.apache.org/jira/browse/SPARK-9073 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Minor Examples with this mistake include: * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119] * [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220] Whomever writes a PR for this JIRA should check all spark.ml Model's copy() methods and set copy's {{Model.parent}} when available. Also verify in unit tests (possibly in a standard method checking Models to share code). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9064) Job fail due to timeout with spark-packages
Kai Sasaki created SPARK-9064: - Summary: Job fail due to timeout with spark-packages Key: SPARK-9064 URL: https://issues.apache.org/jira/browse/SPARK-9064 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Affects Versions: 1.5.0 Reporter: Kai Sasaki With spark-packages(Any packages), spark job fails due to timeout. Without spark-packages any jobs are working. {code} $ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 scala import org.apache.spark.mllib.util._ import org.apache.spark.mllib.util._ scala sc.textFile(README.md).count [Stage 0: (0 + 2) / 2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} All error logs are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9064) Job fail due to timeout with spark-packages
[ https://issues.apache.org/jira/browse/SPARK-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-9064: -- Description: With spark-packages(Any packages), spark job fails due to timeout. Without spark-packages any jobs are working. {code} $ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 scala import org.apache.spark.mllib.util._ import org.apache.spark.mllib.util._ scala sc.textFile(README.md).count [Stage 0: (0 + 2) / 2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} All error logs are attached. Environment is MacOSX 10.10.4. Spark was build from master branch and target was hadoop-2.6. was: With spark-packages(Any packages), spark job fails due to timeout. Without spark-packages any jobs are working. {code} $ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 scala import org.apache.spark.mllib.util._ import org.apache.spark.mllib.util._ scala sc.textFile(README.md).count [Stage 0: (0 + 2) / 2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at
[jira] [Updated] (SPARK-9064) Job fail due to timeout with spark-packages
[ https://issues.apache.org/jira/browse/SPARK-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-9064: -- Attachment: error_logs.txt Job fail due to timeout with spark-packages --- Key: SPARK-9064 URL: https://issues.apache.org/jira/browse/SPARK-9064 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Affects Versions: 1.5.0 Reporter: Kai Sasaki Labels: package Attachments: error_logs.txt With spark-packages(Any packages), spark job fails due to timeout. Without spark-packages any jobs are working. {code} $ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 scala import org.apache.spark.mllib.util._ import org.apache.spark.mllib.util._ scala sc.textFile(README.md).count [Stage 0: (0 + 2) / 2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} All error logs are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620534#comment-14620534 ] Kai Sasaki commented on SPARK-8939: --- Should also {{--num-executors}} be 2 as default value of number of executors on yarn? {code} --num-executors NUM Number of executors to launch (Default: 2). {code} If possible, can I work on this JIRA? YARN EC2 default setting fails with IllegalArgumentException Key: SPARK-8939 URL: https://issues.apache.org/jira/browse/SPARK-8939 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.5.0 Reporter: Andrew Or I just set it up from scratch using the spark-ec2 script. Then I ran {code} bin/spark-shell --master yarn {code} which failed with {code} 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Unknown/unsupported param List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, --executor-cores, 2, --name, Spark shell) {code} This goes away if I provide `--num-executors`, but we should fix the default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610018#comment-14610018 ] Kai Sasaki commented on SPARK-1503: --- [~staple] [~josephkb] Thank you for pinging and inspiring information! I'll rewrite current patch based on your logic and codes. Thanks a lot. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()
[ https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595048#comment-14595048 ] Kai Sasaki commented on SPARK-8419: --- In the {{Statistics#colStats}}, the number of rows seems to be updated in {{computeColumnSummaryStatistics}} with {{updateNumRows}}. This is computed through distributed process which is calculated inside of {{RDD#treeAggregate}}. So I think there is no extra {{count()}} when just only creating {{RowMatrix}}. Is this assumption correct? Statistics.colStats could avoid an extra count() Key: SPARK-8419 URL: https://issues.apache.org/jira/browse/SPARK-8419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Statistics.colStats goes through RowMatrix to compute the stats. But RowMatrix.computeColumnSummaryStatistics does an extra count() which could be avoided. Not going through RowMatrix would skip this extra pass over the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils
[ https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491060#comment-14491060 ] Kai Sasaki commented on SPARK-6263: --- Can I work on this JIRA and please assign to me? Thank you. Python MLlib API missing items: Utils - Key: SPARK-6263 URL: https://issues.apache.org/jira/browse/SPARK-6263 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. MLUtils * appendBias * kFold * loadVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
Kai Sasaki created SPARK-6720: - Summary: PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-4036: -- Attachment: CRF_design.1.pdf Add Conditional Random Fields (CRF) algorithm to Spark MLlib Key: SPARK-4036 URL: https://issues.apache.org/jira/browse/SPARK-4036 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Kai Sasaki Attachments: CRF_design.1.pdf Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. The paper: http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6643) Python API for StandardScalerModel
Kai Sasaki created SPARK-6643: - Summary: Python API for StandardScalerModel Key: SPARK-6643 URL: https://issues.apache.org/jira/browse/SPARK-6643 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 This is the sub-task of SPARK-6254. Wrap missing method for {{StandardScalerModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386636#comment-14386636 ] Kai Sasaki commented on SPARK-4036: --- [~mengxr] I write a design doc based on your advice. Thank you. Add Conditional Random Fields (CRF) algorithm to Spark MLlib Key: SPARK-4036 URL: https://issues.apache.org/jira/browse/SPARK-4036 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Kai Sasaki Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. The paper: http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6615) Python API for Word2Vec
Kai Sasaki created SPARK-6615: - Summary: Python API for Word2Vec Key: SPARK-6615 URL: https://issues.apache.org/jira/browse/SPARK-6615 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6598) Python API for IDFModel
Kai Sasaki created SPARK-6598: - Summary: Python API for IDFModel Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386036#comment-14386036 ] Kai Sasaki commented on SPARK-6261: --- [~josephkb] I created JIRA for IDFModel here. [SPARK-6598|https://issues.apache.org/jira/browse/SPARK-6598]. Thank you! Python MLlib API missing items: Feature --- Key: SPARK-6261 URL: https://issues.apache.org/jira/browse/SPARK-6261 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. StandardScalerModel * All functionality except predict() is missing. IDFModel * idf Word2Vec * setMinCount Word2VecModel * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381039#comment-14381039 ] Kai Sasaki commented on SPARK-6261: --- [~josephkb] Can I work on this JIRA? And I have a question. {{StandardScalerModel}} seems to have no method named {{predict()}}, correct? Are we supposed to wrap other methods implemented in {{StandardScalerModel}}? Python MLlib API missing items: Feature --- Key: SPARK-6261 URL: https://issues.apache.org/jira/browse/SPARK-6261 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. StandardScalerModel * All functionality except predict() is missing. IDFModel * idf Word2Vec * setMinCount Word2VecModel * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362295#comment-14362295 ] Kai Sasaki commented on SPARK-4036: --- [~mengxr] I'm thinking about design of CRF. I have a question. Current gradient descent which is implemented in MLLib should be used in CRF, but current {{Optimizer}} can receive only {{RDD\[Double, Vector\]}}. General CRF should receive various type of labels and optimize it. Is there any plan to expand {{Optimizer}} can optimize non-double labels(such as string or other). Or do you have any other idea to train non-double labels in current {{Optimizer}}. Thank you. Add Conditional Random Fields (CRF) algorithm to Spark MLlib Key: SPARK-4036 URL: https://issues.apache.org/jira/browse/SPARK-4036 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Kai Sasaki Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. The paper: http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6336) LBFGS should document what convergenceTol means
[ https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362347#comment-14362347 ] Kai Sasaki commented on SPARK-6336: --- I created this patch. But the unit test has already been written for convergence tolerance. Is [this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala#L138-L195] correct? LBFGS should document what convergenceTol means --- Key: SPARK-6336 URL: https://issues.apache.org/jira/browse/SPARK-6336 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance. We should document that convergenceTol is relative and ensure in a unit test that this behavior does not change in Breeze without us realizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292750#comment-14292750 ] Kai Sasaki commented on SPARK-5261: --- [~gq] Can you provide us data set? I tried with some patterns of num partitions but cannot reproduced it. In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36) {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
[ https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270316#comment-14270316 ] Kai Sasaki commented on SPARK-5119: --- I think impurity implemented MLlib cannot keep negative labels. In this case it is -1. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala#L93 Should impurity support negative label? java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model --- Key: SPARK-5119 URL: https://issues.apache.org/jira/browse/SPARK-5119 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.1.0, 1.2.0 Environment: Linux ubuntu 14.04 Reporter: Vivek Kulkarni First I tried to see if there was a bug raised before with similar trace. I found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the suggestion to upgarde to latest code bae ( I cloned from master branch) does not fix this issue. Issue: try to train a decision tree classifier on some data.After training and when it begins colllect, it crashes: 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93) at org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100) at org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536 ) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533 ) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at scala.collection.Iterator$class.foreach(Iterator.scala:727) Minimal code: data = MLUtils.loadLibSVMFile(sc, '/scratch1/vivek/datasets/private/a1a').cache() model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, maxDepth=5, maxBins=100) Just download the data from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267450#comment-14267450 ] Kai Sasaki commented on SPARK-4284: --- I'd like to work on this issue, if it does not fixed. Could you assign this to me? BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types
[ https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267535#comment-14267535 ] Kai Sasaki commented on SPARK-4284: --- [~srowen] It's very helpful advice. Thank you! BinaryClassificationMetrics precision-recall method names should correspond to return types --- Key: SPARK-4284 URL: https://issues.apache.org/jira/browse/SPARK-4284 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor BinaryClassificationMetrics has several methods which work with (recall, precision) pairs, but the method names all use the wrong order (pr). This order should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266986#comment-14266986 ] Kai Sasaki commented on SPARK-5019: --- I'm sorry for submitting premature PR. Is it OK to ask anyone to assign some tickets I want to take from next time? Because I seem no rights to assign issues to myself. I want to check (SPARK-5018) and review it. Sorry for disturbing you [~tgaloppo] Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264558#comment-14264558 ] Kai Sasaki commented on SPARK-5073: --- I did not notice above comment. Sorry, I've just created PR for this issue. spark.storage.memoryMapThreshold have two default value - Key: SPARK-5073 URL: https://issues.apache.org/jira/browse/SPARK-5073 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Yuan Jianhui Priority: Minor In org.apache.spark.storage.DiskStore: val minMemoryMapBytes = blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L) In org.apache.spark.network.util.TransportConf: public int memoryMapBytes() { return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 1024); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4607) Add random seed to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237029#comment-14237029 ] Kai Sasaki commented on SPARK-4607: --- [~josephkb] I think each trees in iterations of GrandientBoostedTrees is always trained all training data. Is there any case when we have to do subsampling with making RandomForest? Current GrandientBoostedTrees code uses non subsampling RandomForest. Add random seed to GradientBoostedTrees --- Key: SPARK-4607 URL: https://issues.apache.org/jira/browse/SPARK-4607 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Gradient Boosted Trees does not take a random seed, but it uses randomness if the subsampling rate is 1. It should take a random seed parameter. This update will also help to make unit tests more stable by allowing determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4607) Add random seed to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237029#comment-14237029 ] Kai Sasaki edited comment on SPARK-4607 at 12/7/14 2:34 AM: [~josephkb] I think each trees in iterations of GrandientBoostedTrees are always trained all training data. Is there any case when we have to do subsampling with making RandomForest? Current GrandientBoostedTrees code uses non subsampling RandomForest. was (Author: lewuathe): [~josephkb] I think each trees in iterations of GrandientBoostedTrees is always trained all training data. Is there any case when we have to do subsampling with making RandomForest? Current GrandientBoostedTrees code uses non subsampling RandomForest. Add random seed to GradientBoostedTrees --- Key: SPARK-4607 URL: https://issues.apache.org/jira/browse/SPARK-4607 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Gradient Boosted Trees does not take a random seed, but it uses randomness if the subsampling rate is 1. It should take a random seed parameter. This update will also help to make unit tests more stable by allowing determinism (using a small set of fixed random seeds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4652) Add docs about spark-git-repo option
Kai Sasaki created SPARK-4652: - Summary: Add docs about spark-git-repo option Key: SPARK-4652 URL: https://issues.apache.org/jira/browse/SPARK-4652 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Minor It was a little hard to understand how to use --spark-git-repo option on spark-ec2 script. Some additional documentation might be needed to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4656) Typo in Programming Guide markdown
Kai Sasaki created SPARK-4656: - Summary: Typo in Programming Guide markdown Key: SPARK-4656 URL: https://issues.apache.org/jira/browse/SPARK-4656 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Trivial Grammatical error in Programming Guide document -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown
[ https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228998#comment-14228998 ] Kai Sasaki commented on SPARK-4656: --- Created the patch. Please review it. https://github.com/apache/spark/pull/3412 Typo in Programming Guide markdown -- Key: SPARK-4656 URL: https://issues.apache.org/jira/browse/SPARK-4656 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Priority: Trivial Grammatical error in Programming Guide document -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224467#comment-14224467 ] Kai Sasaki commented on SPARK-4036: --- Hi, I want to work on this ticket. I'll write a base implementation of CRF running on MLlib. Could you please assign this ticket to me? Add Conditional Random Fields (CRF) algorithm to Spark MLlib Key: SPARK-4036 URL: https://issues.apache.org/jira/browse/SPARK-4036 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. The paper: http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222382#comment-14222382 ] Kai Sasaki commented on SPARK-4490: --- [~srowen] Yes, exactly. So one option is to remove test scope dependency written in pox.xml in order to add always commons-math3 and breeze into classpath whenever I built spark source without test. Previously I checked 1.1.0 current release source code distributed on apache site(https://spark.apache.org/downloads.html) But on current master HEAD, these problem has been already solved. (removing test scope dependency. So I think this problem is fixed. Thank you. Not found RandomGenerator through spark-shell - Key: SPARK-4490 URL: https://issues.apache.org/jira/browse/SPARK-4490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: spark-shell Reporter: Kai Sasaki In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` scala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221895#comment-14221895 ] Kai Sasaki commented on SPARK-4288: --- [~mengxr] Thank you. I'll join. Add Sparse Autoencoder algorithm to MLlib -- Key: SPARK-4288 URL: https://issues.apache.org/jira/browse/SPARK-4288 Project: Spark Issue Type: Wish Components: MLlib Reporter: Guoqiang Li Labels: features Are you proposing an implementation? Is it related to the neural network JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221779#comment-14221779 ] Kai Sasaki commented on SPARK-4490: --- Sean, Thank you for replying. Anyway I checked HEAD of current master branch and this scope dependency has already been removed. It won't be occurred in future version. https://github.com/apache/spark/blob/master/pom.xml#L378-L382 https://github.com/apache/spark/blob/master/core/pom.xml#L123-L126 Not found RandomGenerator through spark-shell - Key: SPARK-4490 URL: https://issues.apache.org/jira/browse/SPARK-4490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: spark-shell Reporter: Kai Sasaki In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` scala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 44 more ``` -- This message was sent by Atlassian
[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219277#comment-14219277 ] Kai Sasaki commented on SPARK-4490: --- I found the reason why this error is occurred. The dependency scope of commons-math3 is defined as test in pom.xml ``` dependency groupIdorg.apache.commons/groupId artifactIdcommons-math3/artifactId version3.3/version scopetest/scope /dependency ``` I built spark with -DskipTests and cannot include commons-math3 in my classpath. But is it necessary to define this scope as test. I think it is more useful to remove this scope definition as most other dependencies. There might be some cases to build without test. Not found RandomGenerator through spark-shell - Key: SPARK-4490 URL: https://issues.apache.org/jira/browse/SPARK-4490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: spark-shell Reporter: Kai Sasaki In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` scala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
[jira] [Created] (SPARK-4490) Not found RandomGenerator through spark-shell
Kai Sasaki created SPARK-4490: - Summary: Not found RandomGenerator through spark-shell Key: SPARK-4490 URL: https://issues.apache.org/jira/browse/SPARK-4490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: spark-shell Reporter: Kai Sasaki Priority: Critical In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` cala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 44 more ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-4490: -- Priority: Major (was: Critical) Not found RandomGenerator through spark-shell - Key: SPARK-4490 URL: https://issues.apache.org/jira/browse/SPARK-4490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: spark-shell Reporter: Kai Sasaki In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` scala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 44 more ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-4490: -- Description: In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` scala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at .init(console:33) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) aused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 44 more ``` was: In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is used. There is some workaround about this problem. http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 ``` cala import breeze.linalg._ import breeze.linalg._ scala Matrix.rand[Double](3, 3) java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator at breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC.init(console:21) at $iwC$$iwC$$iwC.init(console:23) at $iwC$$iwC.init(console:25) at $iwC.init(console:27) at init(console:29) at
[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204538#comment-14204538 ] Kai Sasaki commented on SPARK-4288: --- Can I take charge of this ticket? I have a starter implementation about autoencoder and deep neural network written in Scala. I'll transplant this to be able to run on spark platform. https://github.com/Lewuathe/42 Thank you. Add Sparse Autoencoder algorithm to MLlib -- Key: SPARK-4288 URL: https://issues.apache.org/jira/browse/SPARK-4288 Project: Spark Issue Type: Wish Components: MLlib Reporter: Guoqiang Li Labels: features Are you proposing an implementation? Is it related to the neural network JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org