[jira] [Created] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-12302:
--

 Summary: Example for servlet filter used by spark.ui.filters
 Key: SPARK-12302
 URL: https://issues.apache.org/jira/browse/SPARK-12302
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.5.2
Reporter: Kai Sasaki
Priority: Trivial


Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
often difficult to understand how to write filter code and how to integrate 
actual spark applications. 

It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11938) Expose numFeatures in all ML PredictionModel for PySpark

2015-12-07 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044673#comment-15044673
 ] 

Kai Sasaki commented on SPARK-11938:


[~mengxr] [~yanboliang] Sorry for bothering you, but could you review this if 
possible? Thank you.

> Expose numFeatures in all ML PredictionModel for PySpark
> 
>
> Key: SPARK-11938
> URL: https://issues.apache.org/jira/browse/SPARK-11938
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-9715 provide support for numFeatures in all ML PredictionModel, we 
> should expose it at Python side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights

2015-12-01 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035194#comment-15035194
 ] 

Kai Sasaki commented on SPARK-11520:


[~mengxr] [~josephkb] Could review this if possible?

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-18 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012964#comment-15012964
 ] 

Kai Sasaki commented on SPARK-11520:


[~josephkb] The metrics of {{RegressionMetrics}} seems to be based on 
{{MultivariateStatisticalSummary}}. And current {{RegressionMetrics}} does not 
support weighted samples as argument. So we can pass the weighted samples to 
MultivariateStatisticalSummary ({{MultivariateOnlineSummarizer}}) and calculate 
metrics for regression metrics. 
Is this assumption correct? Can I work on this JIRA, if possible?

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013068#comment-15013068
 ] 

Kai Sasaki edited comment on SPARK-4036 at 11/19/15 7:32 AM:
-

[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work on this JIRA instead of me.
Anyway, can I give a check and comment to your patch?


was (Author: lewuathe):
[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work in this JIRA instead of me.
Anyway, can I give a check and comment to your patch?

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013068#comment-15013068
 ] 

Kai Sasaki commented on SPARK-4036:
---

[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work in this JIRA instead of me.
Anyway, can I give a check and comment to your patch?

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:48 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:48 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with `glm`.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki commented on SPARK-11439:


[~nakul02]
It seems to indicate the model in SparkR here. According to this documentation, 
you can create SparkR linear model with `glm`.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:47 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with `glm`.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here. According to this documentation, 
you can create SparkR linear model with `glm`.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:49 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
argument. So in summary we can write the code like this to use 
{{LinearRegressionModel}} in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
parameter.
So in summary we can write the code like this to use {{LinearRegressionModel}} 
in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:50 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
argument. So in summary we can write the code like this to use 
{{LinearRegressionModel}} in SparkR.
{code}
df <- createDataFrame(sqlContext, iris) // You should replace with generated 
data
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
$devianceResiduals
 Min   Max
 -1.307112 1.412532

$coefficients
   Estimate  Std. Error t value  Pr(>|t|)
(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
Species_versicolor 1.458743  0.1121079  13.01195 0
Species_virginica  1.946817  0.100015   19.46525 0
{code}

In my environment, it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
argument. So in summary we can write the code like this to use 
{{LinearRegressionModel}} in SparkR.
{code}
df <- createDataFrame(sqlContext, iris)
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-12 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003384#comment-15003384
 ] 

Kai Sasaki edited comment on SPARK-11439 at 11/13/15 1:51 AM:
--

[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
argument. So in summary we can write the code like this to use 
{{LinearRegressionModel}} in SparkR.
{code}
df <- createDataFrame(sqlContext, iris) // You should replace with generated 
data
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
$devianceResiduals
 Min   Max
 -1.307112 1.412532

$coefficients
   Estimate  Std. Error t value  Pr(>|t|)
(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
Species_versicolor 1.458743  0.1121079  13.01195 0
Species_virginica  1.946817  0.100015   19.46525 0
{code}

In my environment (running on bin/sparkR), it seems to work.


was (Author: lewuathe):
[~nakul02]
It seems to indicate the model in SparkR here not gmlnet. According to this 
documentation, you can create SparkR linear model with {{glm}} function.
https://spark.apache.org/docs/latest/sparkr.html#machine-learning

This will call {{SparkRWrapper#fitRModelFormula}}. It returns 
LinearRegressionModel with Pipeline when it receives "gaussian" as second 
argument. So in summary we can write the code like this to use 
{{LinearRegressionModel}} in SparkR.
{code}
df <- createDataFrame(sqlContext, iris) // You should replace with generated 
data
fit <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
summary(fit)
$devianceResiduals
 Min   Max
 -1.307112 1.412532

$coefficients
   Estimate  Std. Error t value  Pr(>|t|)
(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
Species_versicolor 1.458743  0.1121079  13.01195 0
Species_virginica  1.946817  0.100015   19.46525 0
{code}

In my environment, it seems to work.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11717) Ignore R session and history files from git

2015-11-12 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-11717:
--

 Summary: Ignore R session and history files from git
 Key: SPARK-11717
 URL: https://issues.apache.org/jira/browse/SPARK-11717
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Kai Sasaki
Priority: Trivial


SparkR generates R session data and history files under current directory.
It might be useful to ignore these files even running SparkR on spark directory 
for test or development.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11439) Optiomization of creating sparse feature without dense one

2015-10-31 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-11439:
--

 Summary: Optiomization of creating sparse feature without dense one
 Key: SPARK-11439
 URL: https://issues.apache.org/jira/browse/SPARK-11439
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Kai Sasaki
Priority: Minor


Currently, sparse feature generated in {{LinearDataGenerator}} needs to create 
dense vectors once. It is cost efficient to prevent generating sparse feature 
without generating dense vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11439) Optiomization of creating sparse feature without dense one

2015-10-31 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-11439:
---
Description: Currently, sparse feature generated in {{LinearDataGenerator}} 
needs to create dense vectors once. It is cost efficient to prevent from 
generating dense feature when creating sparse features.  (was: Currently, 
sparse feature generated in {{LinearDataGenerator}} needs to create dense 
vectors once. It is cost efficient to prevent generating sparse feature without 
generating dense vectors.)

> Optiomization of creating sparse feature without dense one
> --
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid

2015-10-29 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981720#comment-14981720
 ] 

Kai Sasaki commented on SPARK-11223:


Yes. I think so too. But I wonder the purpose of this is only for debugging. If 
so only printing parameters might be sufficient.


> PySpark CrossValidatorModel does not output metrics for every param in 
> paramGrid
> 
>
> Key: SPARK-11223
> URL: https://issues.apache.org/jira/browse/SPARK-11223
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Raela Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11239) PMML export for ML linear regression

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973214#comment-14973214
 ] 

Kai Sasaki commented on SPARK-11239:


[~holdenk] Hi, these tickets under SPARK-11171 are blocked by SPARK-11241?

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11234) What's cooking classification

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973221#comment-14973221
 ] 

Kai Sasaki commented on SPARK-11234:


[~xusen] Thank you so much for very insightful experiments!

{quote}
4. The evaluator forces me to select a metric method. But sometimes I want to 
see all the evaluation results, say F1, precision-recall, AUC, etc.
{quote}

Yes, I agree with you. At the initial phase to run machine learning algorithm, 
it is the case when we don't know what metrics we should see.

{quote}
5. ML transformers will get stuck when facing with Int type. It's strange that 
we have to transform all Int values to double values before hand. I think a 
wise auto casting is helpful.
{quote}
What kind of Transformer got stuck? The first transformer cannot handle input 
int values, do you think?


> What's cooking classification
> -
>
> Key: SPARK-11234
> URL: https://issues.apache.org/jira/browse/SPARK-11234
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973745#comment-14973745
 ] 

Kai Sasaki commented on SPARK-7146:
---

There are several times when I want to use internal resources(e.g. shared 
params, optimization) of Spark as our own library or framework. 
The task to write these code again often cause trouble and long time 
development. In addition to this, as you said, there might be the several 
implementations which has the same name but different functionality. 

{quote}
Cons:
Users have to be careful since parameters can have different meanings for 
different algorithms.
{quote}

I think this is also true when even {{sharedParams}} is private because 
application developers will implements their own params which retain almost 
same name with {{sharedParams}}. It becomes confusing.

So basically it might be better to enable developers to use {{sharedParams}} 
inside their own frameworks. It does not mean that making it public directly. 
As [~josephkb] proposed on (b), it is good way to make it open for developers 
but create some restrictions.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> Proposal: Either
> (a) make the shared params private to encourage users to write specialized 
> documentation and value checks for parameters, or
> (b) design a better way to encourage overriding documentation and parameter 
> value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11207) Add test cases for normal LinearRegression solver as followup.

2015-10-20 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-11207:
--

 Summary: Add test cases for normal LinearRegression solver as 
followup.
 Key: SPARK-11207
 URL: https://issues.apache.org/jira/browse/SPARK-11207
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Kai Sasaki


This is the follow up work of SPARK-10668.

* Fix miner style issues.
* Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.

2015-10-20 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-11207:
---
Summary: Add test cases for solver selection of LinearRegression as 
followup.  (was: Add test cases for normal LinearRegression solver as followup.)

> Add test cases for solver selection of LinearRegression as followup.
> 
>
> Key: SPARK-11207
> URL: https://issues.apache.org/jira/browse/SPARK-11207
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>  Labels: ML
>
> This is the follow up work of SPARK-10668.
> * Fix miner style issues.
> * Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-23 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904395#comment-14904395
 ] 

Kai Sasaki commented on SPARK-10668:


So sorry for being late for submitting patch and thank you for supporting me.
[~mengxr] [~yanbo] Could you review current patch?

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-10715:
---
Labels: ML  (was: )

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Bug
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: ML
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-10715:
--

 Summary: Duplicate initialzation flag in WeightedLeastSquare
 Key: SPARK-10715
 URL: https://issues.apache.org/jira/browse/SPARK-10715
 Project: Spark
  Issue Type: Bug
Reporter: Kai Sasaki
Priority: Trivial


There are duplicate set of initialization flag in {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-09-18 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876767#comment-14876767
 ] 

Kai Sasaki commented on SPARK-10709:


[~yhuai] So do you mean this error message should clarify the difference 
between that the path does not exist and that path is not passed to `path` 
parameter?

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> 

[jira] [Commented] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-18 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876787#comment-14876787
 ] 

Kai Sasaki commented on SPARK-10668:


[~mengxr] Hello, can I work on this JIRA? Please assign it to me. Thank you.

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2015-09-17 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802747#comment-14802747
 ] 

Kai Sasaki commented on SPARK-10388:


[~mengxr] I totally agree with you. The initial version should be minimal and 
simple. So the previous suggestion is just for desired features. In this sense, 
the initial suggestion might be sufficient as MVP. 
{quote}
For example, I don't think json and orc are commonly used for ML datasets.
{quote}
Yes, json or orc are not used for machine learning data. I just think public 
dataset loader should be flexible to later extension. That means other dataset 
format can be added as plugin. 
{quote}
A proper implementation would be implementing HTTP as a Hadoop FileSystem.
{quote}
Does it mean public dataset can be used through RDD directly? For example we 
can use {{val data = sc.textFile( // public dataset url )}}

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2015-09-15 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745430#comment-14745430
 ] 

Kai Sasaki commented on SPARK-10388:


It seems very useful for the beginners who want to try Spark ML on their 
projects and who want to see the behaviour of Pipeline API. I have several 
comments.

* It might be better to do lazy download. Some datasets are very large, so it 
will be good to download them when it is realy needed. In above example, 
datasets are downloaded at {{datasets.show()}}.
* Once datasets are downloaded, it will be better to cache these data at the 
local. And it requires repository API to publicate the latest update. Therefore 
public dataset loader can update its local cache properly.
* I agree with the idea to allow 3rd-party to create their repositories. It 
requires to fix the design of repository itself. We can create the 
specification and also some SDK if possible. (Should these be included Spark 
projects?)
* We should not restrict the format which public dataset loader can load. 
Current {{DataFrameReader}} can read such as json, libsvm or orc. There might 
be various kind of format at the public. So it may be reasonable to support 
also these kind of format which is currently not supported in future.
* Although this is a little whim, integration between public dataset loader and 
kaggle datasets increases the use cases of Spark ML.

In general, searching data and loading data are troublesome. This feature makes 
it easier for developers. I want to help this design and implementation. Thank 
you.

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-08-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711298#comment-14711298
 ] 

Kai Sasaki commented on SPARK-10055:


I submitted the initial version of this competition. Although the score is not 
good, there are several points I found in using Spark ML API. There might be 
something which is just caused by my lack of knowledge of Spark ML. So if we 
can already solve with existing code, please let me know.

* There does not seem to be {{Transformer}} which can cast type of columns. In 
this case, {{X}} and {{Y}} are String as default when read by 
[spark-csv|http://spark-packages.org/package/databricks/spark-csv].
  In order to use {{StandardScaler}} to {{X}} and {{Y}}, they must be numeric 
types. I cannot do that with Spark ML `Transformer`. Fortunately, {{spark-csv}}
  can infer types of schema to reading all data once. But in case of no such 
option in reading library, I think it is better to cast column types in Spark 
ML pipeline.
  
* {{StringIndexer}} exports its labels in order by frequencies. But in this 
competition, we have to write in alphabetical order. We have to write some 
extra code
  to convert frequency order labels to alphabetical order.
  
* {{StandardScaler}} can only receive vector data as its own input. In this 
case, I want to scale {{X}} and {{Y}} with {{StandardScaler}}. 
  But these are simple double data, it is necessary to assemble these values 
into feature vector. Is there some case to use `StandardScaler`
  to simple Int data or Double data? We have to assemble these data into a 
feature vector before scaling?
  
The code is 
[here|https://github.com/Lewuathe/kaggle-jobs/blob/master/src/main/scala/com/lewuathe/SfCrimeClassification.scala].
 Thank you.


 San Francisco Crime Classification
 --

 Key: SPARK-10055
 URL: https://issues.apache.org/jira/browse/SPARK-10055
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xusen Yin

 Apply ML pipeline API to San Francisco Crime Classification 
 (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10117) Implement SQL data source API for reading LIBSVM data

2015-08-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711186#comment-14711186
 ] 

Kai Sasaki commented on SPARK-10117:


[~mengxr] If possible, can I work on this JIRA? Thank you!

 Implement SQL data source API for reading LIBSVM data
 -

 Key: SPARK-10117
 URL: https://issues.apache.org/jira/browse/SPARK-10117
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng

 It is convenient to implement data source API for LIBSVM format to have a 
 better integration with DataFrames and ML pipeline API.
 {code}
 import org.apache.spark.ml.source.libsvm._
 val training = sqlContext.read
   .format(libsvm)
   .option(numFeatures, 1)
   .load(path)
 {code}
 This JIRA covers the following:
 1. Read LIBSVM data as a DataFrame with two columns: label: Double and 
 features: Vector.
 2. Accept `numFeatures` as an option.
 3. The implementation should live under `org.apache.spark.ml.source.libsvm`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10110) StringIndexer lacks of parameter handleInvalid.

2015-08-19 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-10110:
--

 Summary: StringIndexer lacks of parameter handleInvalid.
 Key: SPARK-10110
 URL: https://issues.apache.org/jira/browse/SPARK-10110
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Kai Sasaki
 Fix For: 1.5.0


Missing API for pyspark {{StringIndexer.handleInvalid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10111) StringIndexerModel lacks of method labels

2015-08-19 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-10111:
--

 Summary: StringIndexerModel lacks of method labels
 Key: SPARK-10111
 URL: https://issues.apache.org/jira/browse/SPARK-10111
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Kai Sasaki


Missing {{labels}} property of {{StringIndexer}} in pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10111) StringIndexerModel lacks of method labels

2015-08-19 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-10111:
---
Description: Missing {{labels}} property of {{StringIndexerModel}} in 
pyspark.  (was: Missing {{labels}} property of {{StringIndexer}} in pyspark.)

 StringIndexerModel lacks of method labels
 ---

 Key: SPARK-10111
 URL: https://issues.apache.org/jira/browse/SPARK-10111
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Kai Sasaki

 Missing {{labels}} property of {{StringIndexerModel}} in pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10027) Add Python API missing methods for ml.feature

2015-08-19 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702861#comment-14702861
 ] 

Kai Sasaki commented on SPARK-10027:


[~yanbo] Can I work on this JIRA? Thank you.

 Add Python API missing methods for ml.feature
 -

 Key: SPARK-10027
 URL: https://issues.apache.org/jira/browse/SPARK-10027
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 Missing method of ml.feature are listed here:
 * StringIndexer lacks of parameter handleInvalid.
 * StringIndexerModel lacks of method labels. 
 * VectorIndexerModel lacks of methods numFeatures and categoryMaps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10012) Missing test case for Params#arrayLengthGt

2015-08-15 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-10012:
---
Summary: Missing test case for Params#arrayLengthGt  (was: Missing test 
case for ParamsarrayLengthGt)

 Missing test case for Params#arrayLengthGt
 --

 Key: SPARK-10012
 URL: https://issues.apache.org/jira/browse/SPARK-10012
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 1.5.0
Reporter: Kai Sasaki
Priority: Trivial

 Currently there is no test case for {{Params#arrayLengthGt}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10012) Missing test case for ParamsarrayLengthGt

2015-08-15 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-10012:
--

 Summary: Missing test case for ParamsarrayLengthGt
 Key: SPARK-10012
 URL: https://issues.apache.org/jira/browse/SPARK-10012
 Project: Spark
  Issue Type: Test
  Components: ML, Tests
Affects Versions: 1.5.0
Reporter: Kai Sasaki
Priority: Trivial


Currently there is no test case for {{Params#arrayLengthGt}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10009) PySpark Param of Vector type can be set with Python array or numpy.array

2015-08-15 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698515#comment-14698515
 ] 

Kai Sasaki commented on SPARK-10009:


[~yanbo] Current ML models parameters look like must be set as keyword 
dictionary type. Specifically, what type of parameters can be set as vector 
types?

 PySpark Param of Vector type can be set with Python array or numpy.array
 

 Key: SPARK-10009
 URL: https://issues.apache.org/jira/browse/SPARK-10009
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang

 If the type of Param in PySpark ML pipeline is Vector, we can set with Vector 
 currently. We also need to support set it with Python array and numpy.array. 
 It should be handled in the wrapper (_transfer_params_to_java).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10002) SSH problem during Setup of Spark(1.3.0) cluster on EC2

2015-08-15 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698516#comment-14698516
 ] 

Kai Sasaki commented on SPARK-10002:


[~deepalib] Can you login via SSH not ping? Ping does not confirm the state of 
SSH port. So it might be the problem of security group.

 SSH problem during Setup of Spark(1.3.0) cluster on EC2
 ---

 Key: SPARK-10002
 URL: https://issues.apache.org/jira/browse/SPARK-10002
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
 Environment: EC2, SPARK 1.3.0 cluster setup in vpc/subnet.
Reporter: Deepali Bhandari

 Steps to start a Spark cluster with EC2 scripts
 1. I created an ec2 instance in the vpc, and subnet. Amazon Linux 
 2. I dowloaded spark-1.3.0
 3. chmod 400 key file
 4. Export aws access and secret keys
 5. Now ran the command
  ./spark-ec2 --key-pair=deepali-ec2-keypair 
 --identity-file=/home/ec2-user/Spark/deepali-ec2-keypair.pem 
 --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b66 
 --subnet-id=subnet-72fd5905 --resume launch deepali-spark-nodocker
  6. The master and slave instances are created but cannot ssh says host not 
 resolved.
  7. I can ping the master and slave, I can ssh from the command line, but not 
 from the ec2 scripts. 
  8. I have spent more than 2 days now. But no luck yet.
  9. The ec2 scripts dont work .. code has a bug in referencing the cluster 
 nodes via the wrong hostnames 
  
 SCREEN CONSOLE log
  ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home 
   
  
 /ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b 
 --vpc-id=vpc-03d67b6  
   
 6 --subnet-id=subnet-72fd5905 launch deepali-spark-nodocker
 Downloading Boto from PyPi
 Finished downloading Boto
 Setting up security groups...
 Creating security group deepali-spark-nodocker-master
 Creating security group deepali-spark-nodocker-slaves
 Searching for existing cluster deepali-spark-nodocker...
 Spark AMI: ami-9a6e0daa
 Launching instances...
 Launched 1 slaves in us-west-2b, regid = r-0d2088fb
 Launched master in us-west-2b, regid = r-312088c7
 Waiting for AWS to propagate instance metadata...
 Waiting for cluster to enter 'ssh-ready' state...
 Warning: SSH connection error. (This could be temporary.)
 Host: None
 SSH return code: 255
 SSH output: ssh: Could not resolve hostname None: Name or service not known



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-9977:
-

 Summary: The usage of a label generated by StringIndexer
 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Priority: Trivial


By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
following estimator should use this new column through pipeline if it wants to 
use string indexed label. 
I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9841) Params.clear needs to be public

2015-08-13 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695103#comment-14695103
 ] 

Kai Sasaki commented on SPARK-9841:
---

[~josephkb]
Do you have any use case when public clear method is useful? I think parameters 
must have been set through training. When we want to reset parameter and create 
model again, it should be only necessary to train estimator again. 
If there are several cases when public clear method is needed, {{set}} method 
also should be public, I think. Is it correct?

 Params.clear needs to be public
 ---

 Key: SPARK-9841
 URL: https://issues.apache.org/jira/browse/SPARK-9841
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley

 It is currently impossible to clear Param values once set.  It would be 
 helpful to be able to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629527#comment-14629527
 ] 

Kai Sasaki commented on SPARK-9073:
---

[~josephkb] Hi, if possible, can I work on this JIRA? Thank you.

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9064) Job fail due to timeout with spark-packages

2015-07-15 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-9064:
-

 Summary: Job fail due to timeout with spark-packages
 Key: SPARK-9064
 URL: https://issues.apache.org/jira/browse/SPARK-9064
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit
Affects Versions: 1.5.0
Reporter: Kai Sasaki


With spark-packages(Any packages), spark job fails due to timeout. Without 
spark-packages any jobs are working.

{code}
$ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3

scala import org.apache.spark.mllib.util._
import org.apache.spark.mllib.util._

scala sc.textFile(README.md).count
[Stage 0:  (0 + 2) / 
2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.init(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

{code}

All error logs are attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9064) Job fail due to timeout with spark-packages

2015-07-15 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-9064:
--
Description: 
With spark-packages(Any packages), spark job fails due to timeout. Without 
spark-packages any jobs are working.

{code}
$ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3

scala import org.apache.spark.mllib.util._
import org.apache.spark.mllib.util._

scala sc.textFile(README.md).count
[Stage 0:  (0 + 2) / 
2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.init(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

{code}

All error logs are attached. Environment is MacOSX 10.10.4. Spark was build 
from master branch and target was hadoop-2.6.

  was:
With spark-packages(Any packages), spark job fails due to timeout. Without 
spark-packages any jobs are working.

{code}
$ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3

scala import org.apache.spark.mllib.util._
import org.apache.spark.mllib.util._

scala sc.textFile(README.md).count
[Stage 0:  (0 + 2) / 
2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.init(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at 

[jira] [Updated] (SPARK-9064) Job fail due to timeout with spark-packages

2015-07-15 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-9064:
--
Attachment: error_logs.txt

 Job fail due to timeout with spark-packages
 ---

 Key: SPARK-9064
 URL: https://issues.apache.org/jira/browse/SPARK-9064
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit
Affects Versions: 1.5.0
Reporter: Kai Sasaki
  Labels: package
 Attachments: error_logs.txt


 With spark-packages(Any packages), spark job fails due to timeout. Without 
 spark-packages any jobs are working.
 {code}
 $ ./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3
 scala import org.apache.spark.mllib.util._
 import org.apache.spark.mllib.util._
 scala sc.textFile(README.md).count
 [Stage 0:  (0 + 2) / 
 2]15/07/15 15:58:09 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
 java.net.SocketTimeoutException: connect timed out
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
 at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
 at java.net.Socket.connect(Socket.java:579)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
 at sun.net.www.http.HttpClient.init(HttpClient.java:211)
 at sun.net.www.http.HttpClient.New(HttpClient.java:308)
 at sun.net.www.http.HttpClient.New(HttpClient.java:326)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
 at 
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
 at 
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
 at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:652)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:466)
 at 
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:398)
 at 
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:390)
 at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at 
 org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 All error logs are attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException

2015-07-09 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620534#comment-14620534
 ] 

Kai Sasaki commented on SPARK-8939:
---

Should also {{--num-executors}} be 2 as default value of number of executors on 
yarn?
{code}
--num-executors NUM Number of executors to launch (Default: 2).
{code}
If possible, can I work on this JIRA?

 YARN EC2 default setting fails with IllegalArgumentException
 

 Key: SPARK-8939
 URL: https://issues.apache.org/jira/browse/SPARK-8939
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.5.0
Reporter: Andrew Or

 I just set it up from scratch using the spark-ec2 script. Then I ran
 {code}
 bin/spark-shell --master yarn
 {code}
 which failed with
 {code}
 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext.
 java.lang.IllegalArgumentException: Unknown/unsupported param 
 List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, 
 --executor-cores, 2, --name, Spark shell)
 {code}
 This goes away if I provide `--num-executors`, but we should fix the default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-07-01 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610018#comment-14610018
 ] 

Kai Sasaki commented on SPARK-1503:
---

[~staple] [~josephkb] Thank you for pinging and inspiring information! I'll 
rewrite current patch based on your logic and codes. Thanks a lot.


 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()

2015-06-21 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595048#comment-14595048
 ] 

Kai Sasaki commented on SPARK-8419:
---

In the {{Statistics#colStats}}, the number of rows seems to be updated in 
{{computeColumnSummaryStatistics}} with {{updateNumRows}}. This is computed 
through distributed process which is calculated inside of 
{{RDD#treeAggregate}}. So I think there is no extra {{count()}} when just only 
creating {{RowMatrix}}. Is this assumption correct?

 Statistics.colStats could avoid an extra count()
 

 Key: SPARK-8419
 URL: https://issues.apache.org/jira/browse/SPARK-8419
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 Statistics.colStats goes through RowMatrix to compute the stats.  But 
 RowMatrix.computeColumnSummaryStatistics does an extra count() which could be 
 avoided.  Not going through RowMatrix would skip this extra pass over the 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils

2015-04-11 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491060#comment-14491060
 ] 

Kai Sasaki commented on SPARK-6263:
---

Can I work on this JIRA and please assign to me?  Thank you.

 Python MLlib API missing items: Utils
 -

 Key: SPARK-6263
 URL: https://issues.apache.org/jira/browse/SPARK-6263
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 MLUtils
 * appendBias
 * kFold
 * loadVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6720:
-

 Summary: PySpark MultivariateStatisticalSummary unit test for 
normL1 and normL2
 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


Implement correct normL1 and normL2 test.

continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-04-04 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-4036:
--
Attachment: CRF_design.1.pdf

 Add Conditional Random Fields (CRF) algorithm to Spark MLlib
 

 Key: SPARK-4036
 URL: https://issues.apache.org/jira/browse/SPARK-4036
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Kai Sasaki
 Attachments: CRF_design.1.pdf


 Conditional random fields (CRFs) are a class of statistical modelling method 
 often applied in pattern recognition and machine learning, where they are 
 used for structured prediction. 
 The paper: 
 http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6643) Python API for StandardScalerModel

2015-03-31 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6643:
-

 Summary: Python API for StandardScalerModel
 Key: SPARK-6643
 URL: https://issues.apache.org/jira/browse/SPARK-6643
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


This is the sub-task of SPARK-6254.
Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-03-30 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386636#comment-14386636
 ] 

Kai Sasaki commented on SPARK-4036:
---

[~mengxr] I write a design doc based on your advice. Thank you.

 Add Conditional Random Fields (CRF) algorithm to Spark MLlib
 

 Key: SPARK-4036
 URL: https://issues.apache.org/jira/browse/SPARK-4036
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Kai Sasaki

 Conditional random fields (CRFs) are a class of statistical modelling method 
 often applied in pattern recognition and machine learning, where they are 
 used for structured prediction. 
 The paper: 
 http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6615) Python API for Word2Vec

2015-03-30 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6615:
-

 Summary: Python API for Word2Vec
 Key: SPARK-6615
 URL: https://issues.apache.org/jira/browse/SPARK-6615
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


This is the sub-task of 
[SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].

Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6598) Python API for IDFModel

2015-03-29 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6598:
-

 Summary: Python API for IDFModel
 Key: SPARK-6598
 URL: https://issues.apache.org/jira/browse/SPARK-6598
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor


This is the sub-task of 
[SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].

Wrap IDFModel {{idf}} member function for pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature

2015-03-29 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386036#comment-14386036
 ] 

Kai Sasaki commented on SPARK-6261:
---

[~josephkb] I created JIRA for IDFModel here. 
[SPARK-6598|https://issues.apache.org/jira/browse/SPARK-6598]. Thank you!

 Python MLlib API missing items: Feature
 ---

 Key: SPARK-6261
 URL: https://issues.apache.org/jira/browse/SPARK-6261
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 StandardScalerModel
 * All functionality except predict() is missing.
 IDFModel
 * idf
 Word2Vec
 * setMinCount
 Word2VecModel
 * getVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature

2015-03-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381039#comment-14381039
 ] 

Kai Sasaki commented on SPARK-6261:
---

[~josephkb] Can I work on this JIRA? And I have a question. 
{{StandardScalerModel}} seems to have no method named {{predict()}}, correct? 
Are we supposed to wrap other methods implemented in {{StandardScalerModel}}?

 Python MLlib API missing items: Feature
 ---

 Key: SPARK-6261
 URL: https://issues.apache.org/jira/browse/SPARK-6261
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 StandardScalerModel
 * All functionality except predict() is missing.
 IDFModel
 * idf
 Word2Vec
 * setMinCount
 Word2VecModel
 * getVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-03-15 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362295#comment-14362295
 ] 

Kai Sasaki commented on SPARK-4036:
---

[~mengxr] I'm thinking about design of CRF. I have a question. Current gradient 
descent which is implemented in MLLib should be used in CRF, but current 
{{Optimizer}} can receive only {{RDD\[Double, Vector\]}}. General CRF should 
receive various type of labels and optimize it. Is there any plan to expand 
{{Optimizer}} can optimize non-double labels(such as string or other). Or do 
you have any other idea to train non-double labels in current {{Optimizer}}.
Thank you.

 Add Conditional Random Fields (CRF) algorithm to Spark MLlib
 

 Key: SPARK-4036
 URL: https://issues.apache.org/jira/browse/SPARK-4036
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Kai Sasaki

 Conditional random fields (CRFs) are a class of statistical modelling method 
 often applied in pattern recognition and machine learning, where they are 
 used for structured prediction. 
 The paper: 
 http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6336) LBFGS should document what convergenceTol means

2015-03-15 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362347#comment-14362347
 ] 

Kai Sasaki commented on SPARK-6336:
---

I created this patch. But the unit test has already been written for 
convergence tolerance. Is 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/LBFGSSuite.scala#L138-L195]
 correct?


 LBFGS should document what convergenceTol means
 ---

 Key: SPARK-6336
 URL: https://issues.apache.org/jira/browse/SPARK-6336
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Trivial

 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance.  We 
 should document that convergenceTol is relative and ensure in a unit test 
 that this behavior does not change in Breeze without us realizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-01-26 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292750#comment-14292750
 ] 

Kai Sasaki commented on SPARK-5261:
---

[~gq] Can you provide us data set? I tried with some patterns of num partitions 
but cannot reproduced it. 

 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36)
 {code}
 The average absolute value of the word's vector representation is 60731.8
 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(1)
 {code}
 The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model

2015-01-08 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270316#comment-14270316
 ] 

Kai Sasaki commented on SPARK-5119:
---

I think impurity implemented MLlib cannot keep negative labels. In this case it 
is -1.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala#L93
Should impurity support negative label?

 java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree 
 model
 ---

 Key: SPARK-5119
 URL: https://issues.apache.org/jira/browse/SPARK-5119
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.1.0, 1.2.0
 Environment: Linux ubuntu 14.04
Reporter: Vivek Kulkarni

 First I tried to see if there was a bug raised before with similar trace. I 
 found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but 
 the suggestion to upgarde to latest code bae ( I cloned from master branch) 
 does not fix this issue.
 Issue: try to train a decision tree classifier on some data.After training 
 and when it begins colllect, it crashes:
 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1
 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 
 1895)
 java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93)
 at 
 org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536
 )
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533
 )
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at 
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 Minimal code:
  data = MLUtils.loadLibSVMFile(sc, 
 '/scratch1/vivek/datasets/private/a1a').cache()
 model = DecisionTree.trainClassifier(data, numClasses=2, 
 categoricalFeaturesInfo={}, maxDepth=5, maxBins=100)
 Just download the data from: 
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267450#comment-14267450
 ] 

Kai Sasaki commented on SPARK-4284:
---

I'd like to work on this issue, if it does not fixed. Could you assign this to 
me? 

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-01-07 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267535#comment-14267535
 ] 

Kai Sasaki commented on SPARK-4284:
---

[~srowen] It's very helpful advice.  Thank you!

 BinaryClassificationMetrics precision-recall method names should correspond 
 to return types
 ---

 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 BinaryClassificationMetrics has several methods which work with (recall, 
 precision) pairs, but the method names all use the wrong order (pr).  This 
 order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266986#comment-14266986
 ] 

Kai Sasaki commented on SPARK-5019:
---

I'm sorry for submitting premature PR. Is it OK to ask anyone to assign some 
tickets I want to take from next time? Because I seem no rights to assign 
issues to myself. 

I want to check (SPARK-5018) and review it. Sorry for disturbing you [~tgaloppo]


 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value

2015-01-05 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264558#comment-14264558
 ] 

Kai Sasaki commented on SPARK-5073:
---

I did not notice above comment. Sorry, I've just created PR for this issue.

 spark.storage.memoryMapThreshold have two default value
 -

 Key: SPARK-5073
 URL: https://issues.apache.org/jira/browse/SPARK-5073
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Yuan Jianhui
Priority: Minor

 In org.apache.spark.storage.DiskStore:
  val minMemoryMapBytes = 
 blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L)
 In org.apache.spark.network.util.TransportConf:
  public int memoryMapBytes() {
  return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 
 1024);
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4607) Add random seed to GradientBoostedTrees

2014-12-06 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237029#comment-14237029
 ] 

Kai Sasaki commented on SPARK-4607:
---

[~josephkb] I think each trees in iterations of GrandientBoostedTrees is always 
trained all training data. Is there any case when we have to do subsampling 
with making RandomForest? Current GrandientBoostedTrees code uses non 
subsampling RandomForest. 

 Add random seed to GradientBoostedTrees
 ---

 Key: SPARK-4607
 URL: https://issues.apache.org/jira/browse/SPARK-4607
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 Gradient Boosted Trees does not take a random seed, but it uses randomness if 
 the subsampling rate is  1.  It should take a random seed parameter.
 This update will also help to make unit tests more stable by allowing 
 determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4607) Add random seed to GradientBoostedTrees

2014-12-06 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237029#comment-14237029
 ] 

Kai Sasaki edited comment on SPARK-4607 at 12/7/14 2:34 AM:


[~josephkb] I think each trees in iterations of GrandientBoostedTrees are 
always trained all training data. Is there any case when we have to do 
subsampling with making RandomForest? Current GrandientBoostedTrees code uses 
non subsampling RandomForest. 


was (Author: lewuathe):
[~josephkb] I think each trees in iterations of GrandientBoostedTrees is always 
trained all training data. Is there any case when we have to do subsampling 
with making RandomForest? Current GrandientBoostedTrees code uses non 
subsampling RandomForest. 

 Add random seed to GradientBoostedTrees
 ---

 Key: SPARK-4607
 URL: https://issues.apache.org/jira/browse/SPARK-4607
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 Gradient Boosted Trees does not take a random seed, but it uses randomness if 
 the subsampling rate is  1.  It should take a random seed parameter.
 This update will also help to make unit tests more stable by allowing 
 determinism (using a small set of fixed random seeds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4652) Add docs about spark-git-repo option

2014-11-29 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-4652:
-

 Summary: Add docs about spark-git-repo option
 Key: SPARK-4652
 URL: https://issues.apache.org/jira/browse/SPARK-4652
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Minor


It was a little hard to understand how to use --spark-git-repo option on 
spark-ec2 script. Some additional documentation might be needed to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4656) Typo in Programming Guide markdown

2014-11-29 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-4656:
-

 Summary: Typo in Programming Guide markdown
 Key: SPARK-4656
 URL: https://issues.apache.org/jira/browse/SPARK-4656
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Trivial


Grammatical error in Programming Guide document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4656) Typo in Programming Guide markdown

2014-11-29 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228998#comment-14228998
 ] 

Kai Sasaki commented on SPARK-4656:
---

Created the patch. Please review it.
https://github.com/apache/spark/pull/3412

 Typo in Programming Guide markdown
 --

 Key: SPARK-4656
 URL: https://issues.apache.org/jira/browse/SPARK-4656
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kai Sasaki
Priority: Trivial

 Grammatical error in Programming Guide document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2014-11-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224467#comment-14224467
 ] 

Kai Sasaki commented on SPARK-4036:
---

Hi, I want to work on this ticket. I'll write a base implementation of CRF 
running on MLlib. Could you please assign this ticket to me? 

 Add Conditional Random Fields (CRF) algorithm to Spark MLlib
 

 Key: SPARK-4036
 URL: https://issues.apache.org/jira/browse/SPARK-4036
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li

 Conditional random fields (CRFs) are a class of statistical modelling method 
 often applied in pattern recognition and machine learning, where they are 
 used for structured prediction. 
 The paper: 
 http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-23 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222382#comment-14222382
 ] 

Kai Sasaki commented on SPARK-4490:
---

[~srowen] Yes, exactly. So one option is to remove test scope dependency 
written in pox.xml in order to add always commons-math3 and breeze into 
classpath whenever I built spark source without test. Previously 
I checked 1.1.0 current release source code distributed on apache 
site(https://spark.apache.org/downloads.html) 
But on current master HEAD, these problem has been already solved. (removing 
test scope dependency. So I think this problem is fixed. Thank you.

 Not found RandomGenerator through spark-shell
 -

 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki

 In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
 is used. There is some workaround about this problem.
 http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
 ```
 scala import breeze.linalg._
 import breeze.linalg._
 scala Matrix.rand[Double](3, 3)
 java.lang.NoClassDefFoundError: 
 org/apache/commons/math3/random/RandomGenerator
 at 
 breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
 at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC.init(console:21)
 at $iwC$$iwC$$iwC.init(console:23)
 at $iwC$$iwC.init(console:25)
 at $iwC.init(console:27)
 at init(console:29)
 at .init(console:33)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
 at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
 at 
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 aused by: java.lang.ClassNotFoundException: 
 org.apache.commons.math3.random.RandomGenerator
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at 

[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib

2014-11-22 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221895#comment-14221895
 ] 

Kai Sasaki commented on SPARK-4288:
---

[~mengxr] Thank you. I'll join. 

 Add Sparse Autoencoder algorithm to MLlib 
 --

 Key: SPARK-4288
 URL: https://issues.apache.org/jira/browse/SPARK-4288
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Guoqiang Li
  Labels: features

 Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-21 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221779#comment-14221779
 ] 

Kai Sasaki commented on SPARK-4490:
---

Sean, Thank you for replying.
Anyway I checked HEAD of current master branch and this scope dependency has 
already been removed. 
It won't be occurred in future version. 
https://github.com/apache/spark/blob/master/pom.xml#L378-L382
https://github.com/apache/spark/blob/master/core/pom.xml#L123-L126

 Not found RandomGenerator through spark-shell
 -

 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki

 In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
 is used. There is some workaround about this problem.
 http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
 ```
 scala import breeze.linalg._
 import breeze.linalg._
 scala Matrix.rand[Double](3, 3)
 java.lang.NoClassDefFoundError: 
 org/apache/commons/math3/random/RandomGenerator
 at 
 breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
 at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC.init(console:21)
 at $iwC$$iwC$$iwC.init(console:23)
 at $iwC$$iwC.init(console:25)
 at $iwC.init(console:27)
 at init(console:29)
 at .init(console:33)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
 at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
 at 
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 aused by: java.lang.ClassNotFoundException: 
 org.apache.commons.math3.random.RandomGenerator
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 44 more
 ```



--
This message was sent by Atlassian 

[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-20 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219277#comment-14219277
 ] 

Kai Sasaki commented on SPARK-4490:
---

I found the reason why this error is occurred. The dependency scope of 
commons-math3 is defined as test in pom.xml

```
  dependency
groupIdorg.apache.commons/groupId
artifactIdcommons-math3/artifactId
version3.3/version
scopetest/scope
  /dependency
```
I built spark with -DskipTests and cannot include commons-math3 in my 
classpath. But is it necessary to define this scope as test. I think it is more 
useful to remove this scope definition as most other dependencies. There might 
be some cases to build without test. 

 Not found RandomGenerator through spark-shell
 -

 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki

 In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
 is used. There is some workaround about this problem.
 http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
 ```
 scala import breeze.linalg._
 import breeze.linalg._
 scala Matrix.rand[Double](3, 3)
 java.lang.NoClassDefFoundError: 
 org/apache/commons/math3/random/RandomGenerator
 at 
 breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
 at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC.init(console:21)
 at $iwC$$iwC$$iwC.init(console:23)
 at $iwC$$iwC.init(console:25)
 at $iwC.init(console:27)
 at init(console:29)
 at .init(console:33)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
 at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
 at 
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 aused by: java.lang.ClassNotFoundException: 
 org.apache.commons.math3.random.RandomGenerator
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 

[jira] [Created] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-19 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-4490:
-

 Summary: Not found RandomGenerator through spark-shell
 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki
Priority: Critical


In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is 
used. There is some workaround about this problem.
http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3


```
cala import breeze.linalg._
import breeze.linalg._

scala Matrix.rand[Double](3, 3)
java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator
at 
breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC.init(console:21)
at $iwC$$iwC$$iwC.init(console:23)
at $iwC$$iwC.init(console:25)
at $iwC.init(console:27)
at init(console:29)
at .init(console:33)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
aused by: java.lang.ClassNotFoundException: 
org.apache.commons.math3.random.RandomGenerator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 44 more
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-19 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-4490:
--
Priority: Major  (was: Critical)

 Not found RandomGenerator through spark-shell
 -

 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki

 In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
 is used. There is some workaround about this problem.
 http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
 ```
 scala import breeze.linalg._
 import breeze.linalg._
 scala Matrix.rand[Double](3, 3)
 java.lang.NoClassDefFoundError: 
 org/apache/commons/math3/random/RandomGenerator
 at 
 breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
 at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC.init(console:21)
 at $iwC$$iwC$$iwC.init(console:23)
 at $iwC$$iwC.init(console:25)
 at $iwC.init(console:27)
 at init(console:29)
 at .init(console:33)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
 at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
 at 
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 aused by: java.lang.ClassNotFoundException: 
 org.apache.commons.math3.random.RandomGenerator
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 44 more
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-19 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-4490:
--
Description: 
In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is 
used. There is some workaround about this problem.
http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3


```
scala import breeze.linalg._
import breeze.linalg._

scala Matrix.rand[Double](3, 3)
java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator
at 
breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC.init(console:21)
at $iwC$$iwC$$iwC.init(console:23)
at $iwC$$iwC.init(console:25)
at $iwC.init(console:27)
at init(console:29)
at .init(console:33)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
aused by: java.lang.ClassNotFoundException: 
org.apache.commons.math3.random.RandomGenerator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 44 more
```

  was:
In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 is 
used. There is some workaround about this problem.
http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3


```
cala import breeze.linalg._
import breeze.linalg._

scala Matrix.rand[Double](3, 3)
java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator
at 
breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC.init(console:21)
at $iwC$$iwC$$iwC.init(console:23)
at $iwC$$iwC.init(console:25)
at $iwC.init(console:27)
at init(console:29)
at 

[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib

2014-11-10 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204538#comment-14204538
 ] 

Kai Sasaki commented on SPARK-4288:
---

Can I take charge of this ticket? 
I have a starter implementation about autoencoder and deep neural network 
written in Scala. I'll transplant this to be able to run on spark platform.

https://github.com/Lewuathe/42

Thank you.

 Add Sparse Autoencoder algorithm to MLlib 
 --

 Key: SPARK-4288
 URL: https://issues.apache.org/jira/browse/SPARK-4288
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Guoqiang Li
  Labels: features

 Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org