[jira] [Created] (SPARK-2222) Add multiclass evaluation metrics

2014-06-20 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-:
---

 Summary: Add multiclass evaluation metrics
 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov


There is no class in Spark MLlib for measuring the performance of multiclass 
classifiers. This task involves adding such class and unit tests. The following 
measures are to be implemented: per class, micro averaged and weighted averaged 
Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2329) Add multi-label evaluation metrics

2014-06-30 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-2329:
---

 Summary: Add multi-label evaluation metrics
 Key: SPARK-2329
 URL: https://issues.apache.org/jira/browse/SPARK-2329
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov
 Fix For: 1.1.0


There is no class in Spark MLlib for measuring the performance of multi-label  
classifiers. Multilabel classification is when the document is labeled with 
several labels (classes).

This task involves adding the class for multilabel evaluation and unit tests. 
The following measures are to be implemented: Precision, Recall and F1-measure 
(1) based on documents averaged by the number of documents; (2) per label; (3) 
based on labels micro and macro averaged; (4) Hamming loss. Reference: 
Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. Mining 
multi-label data. Data mining and knowledge discovery handbook. Springer US, 
2010. 667-685.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-07-02 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049939#comment-14049939
 ] 

Alexander Ulanov commented on SPARK-1473:
-

Does anybody work on this issue?

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov commented on SPARK-1473:
-

I've implemented Chi-Squared and added a pull request

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets

2014-08-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473
 ] 

Alexander Ulanov edited comment on SPARK-1473 at 8/8/14 8:27 AM:
-

I've implemented Chi-Squared and added a pull request 
https://github.com/apache/spark/pull/1484


was (Author: avulanov):
I've implemented Chi-Squared and added a pull request

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-3403:
---

 Summary: NaiveBayes crashes with blas/lapack native libraries for 
breeze (netlib-java)
 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0


Code:
val model = NaiveBayes.train(train)
val predictionAndLabels = test.map { point =
  val score = model.predict(point.features)
  (score, point.label)
}
predictionAndLabels.foreach(println)

Result: 
program crashes with: Process finished with exit code -1073741819 
(0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-3403:

Attachment: NativeNN.scala

The file contains example that produces the same issue

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121563#comment-14121563
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Yes, I tried using netlib-java separately with the same OpenBLAS setup and it 
worked properly, even within several threads. However I didn't mimic the same 
multi-threading setup as MLlib has because it is complicated.  Do you want me 
to send you all DLLs that I used? I had troubles with compiling OpenBLAS for 
Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites.


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov commented on SPARK-3403:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/5/14 9:53 AM:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?
3)I didn't get any performance improvements with native libraries versus java 
arrays. My matrices are of size up to 10K-20K . Is it supposed to be so?


was (Author: avulanov):
I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140128#comment-14140128
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Posted to netlib-java: https://github.com/fommil/netlib-java/issues/72

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/19/14 7:16 AM:
--

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS ( https://github.com/xianyi/OpenBLAS ) or netlib-java ( 
https://github.com/fommil/netlib-java )? I thought the latter has jni 
implementation. I it ok to submit it as is?


was (Author: avulanov):
Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-4752:
---

 Summary: Classifier based on artificial neural network
 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Implement classifier based on artificial neural network (ANN). Requirements:
1) Use the existing artificial neural network implementation 
https://issues.apache.org/jira/browse/SPARK-2352, 
https://github.com/apache/spark/pull/1290
2) Extend MLlib ClassificationModel trait, 
3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov edited comment on SPARK-4752 at 12/5/14 12:51 AM:
---

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It encodes the class 
label as a binary vector in the ANN output and selects the class based on 
biggest output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.


was (Author: avulanov):
The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

 Classifier based on artificial neural network
 -

 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 Implement classifier based on artificial neural network (ANN). Requirements:
 1) Use the existing artificial neural network implementation 
 https://issues.apache.org/jira/browse/SPARK-2352, 
 https://github.com/apache/spark/pull/1290
 2) Extend MLlib ClassificationModel trait, 
 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4752) Classifier based on artificial neural network

2014-12-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855
 ] 

Alexander Ulanov commented on SPARK-4752:
-

The initial implementation can be found here: 
https://github.com/avulanov/spark/tree/annclassifier. It codes the class label 
as a binary vector in the ANN output and selects the class based on biggest 
output value. The implementation contains unit tests as well. 

The mentioned code uses the following PR: 
https://github.com/apache/spark/pull/1290. It is not yet merged into the main 
branch. I think that I should not make a pull request until then.

 Classifier based on artificial neural network
 -

 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 Implement classifier based on artificial neural network (ANN). Requirements:
 1) Use the existing artificial neural network implementation 
 https://issues.apache.org/jira/browse/SPARK-2352, 
 https://github.com/apache/spark/pull/1290
 2) Extend MLlib ClassificationModel trait, 
 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289677#comment-14289677
 ] 

Alexander Ulanov commented on SPARK-5386:
-

My spark-env.sh contains:
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_INSTANCES=2
I run spark-shell with ./spark-shell --executor-memory 8G --driver-memory 8G. 
In Spark-UI each worker has 8GB of memory. 

Btw., I run this code once again and this time it does not crash and keep 
trying to shedule the job for the failing node that tries to allocate memory 
and fails and so on. Is it a normal behavior?

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Description: 
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed


  was:
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double](n))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed



 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, 
 Ubuntu), each runs 2 Workers
 ./spark-shell --executor-memory 8G --driver-memory 8G
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5386:
---

 Summary: Reduce fails with vectors of big length
 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, 
Ubuntu), each runs 2 Workers
./spark-shell --executor-memory 8G --driver-memory 8G

Reporter: Alexander Ulanov
 Fix For: 1.3.0


Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double](n))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Environment: 
Overall:
6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
Spark:
./spark-shell --executor-memory 8G --driver-memory 8G
spark.driver.maxResultSize 0
java.io.tmpdir and spark.local.dir set to a disk with a lot of free space

  was:
6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
./spark-shell --executor-memory 8G --driver-memory 8G



 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289621#comment-14289621
 ] 

Alexander Ulanov commented on SPARK-5386:
-

I allocate 8G for driver and each worker. Could you suggest why it is not 
enough for handling reduce operation with 60M vector of Double?

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you for suggestions.
1. count() does work, it returns 12
2. It failed with p = 2. However, in some of my previous experiments it did not 
fail even for p up to 5 or 7 (in different runs)

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708
 ] 

Alexander Ulanov edited comment on SPARK-5386 at 1/23/15 6:52 PM:
--

Thank you for suggestions.
1. count() does work, it returns 12
2. Full script failed with p = 2. However, in some of my previous experiments 
it did not fail even for p up to 5 or 7 (in different runs)


was (Author: avulanov):
Thank you for suggestions.
1. count() does work, it returns 12
2. It failed with p = 2. However, in some of my previous experiments it did not 
fail even for p up to 5 or 7 (in different runs)

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-5386:

Description: 
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
vv.count()
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed


  was:
Code:

import org.apache.spark.mllib.rdd.RDDFunctions._
import breeze.linalg._
import org.apache.log4j._
Logger.getRootLogger.setLevel(Level.OFF)
val n = 6000
val p = 12
val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
vv.reduce(_ + _)

When executing in shell it crashes after some period of time. One of the node 
contain the following in stdout:
Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /datac/spark/app-20150123091936-/89/hs_err_pid2247.log

During the execution there is a message: Job aborted due to stage failure: 
Exception while getting task result: java.io.IOException: Connection from 
server-12.net/10.10.10.10:54701 closed



 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289880#comment-14289880
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you, it might be the problem. I was trying to run GC before each 
operation but it did not help. Probably, it takes a lot of memory to run 
initialization of Breeze Dense Vector. Assuming that the problem is due to 
insufficient memory on the Worker node, I am curious, what will happen on 
Driver? Will it receive 12 vectors of size 60M Doubles and then do the 
aggregation? Is it feasible? (P.S. I know that there is a treeReduce function 
that forces do partial aggregation on Workers. However, for big number of 
Wokers the problem will remain in treeReduce as well, as far as I understand) 

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-02-03 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5575:
---

 Summary: Artificial neural networks for MLlib deep learning
 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov


Goal: Implement various types of artificial neural networks

Motivation: deep learning trend

Requirements: 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277986#comment-14277986
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I would like to improve Gradient interface, so it will be able to process 
something more general than `Label` (which is relevant only to classifiers but 
not to other machine learning methods) and also allowing batch processing. The 
simplest way for me of doing this is to add another function to `Gradient` 
interface:

def compute(data: Vector, output: Vector, weights: Vector, cumGradient: 
Vector): Double

In `Gradient` trait it should call `compute` with `label`. Of course, one needs 
to make some adjustments to LBFGS and GradientDescent optimizers, replacing 
label: double with output:vector. 

 For batch processing one can put data and output points stacked into a long 
vector (matrices are stored in this way in breeze) and pass them with the 
proposed interface.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277988#comment-14277988
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Also, asynchronous gradient update might be a good thing to have.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5362:
---

 Summary: Gradient and Optimizer to support generic output (instead 
of label) and data batches
 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Currently, Gradient and Optimizer interfaces support data in form of 
RDD[Double, Vector] which refers to label and features. This limits its 
application to classification problems. For example, artificial neural network 
demands Vector as output (instead of label: Double). Moreover, current 
interface does not support data batches. I propose to replace label: Double 
with output: Vector. It enables passing generic output instead of label and 
also passing data and output batches stored in corresponding vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286703#comment-14286703
 ] 

Alexander Ulanov commented on SPARK-5362:
-

https://github.com/apache/spark/pull/4152

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-21 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286706#comment-14286706
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I've implemented my proposition with Vector as output in 
https://issues.apache.org/jira/browse/SPARK-5362

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328246#comment-14328246
 ] 

Alexander Ulanov commented on SPARK-5912:
-

Sure, I can. Could you point me to some template or a good example of a 
programming guide?

 Programming guide for feature selection
 ---

 Key: SPARK-5912
 URL: https://issues.apache.org/jira/browse/SPARK-5912
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 The new ChiSqSelector for feature selection should have a section in the 
 Programming Guide.  It should probably be under the feature extraction and 
 transformation section as a new subsection for feature selection.
 If we get more feature selection methods later on, we could expand it to a 
 larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7316) Add step capability to RDD sliding window

2015-05-04 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-7316:

Description: 
RDDFunctions in MLlib contains sliding window implementation with step 1. User 
should be able to define step. This capability should be implemented.

Although one can generate sliding windows with step 1 and then filter every Nth 
window, it might take much more time and disk space depending on the step size. 
For example, if your window is 1000 then you will generate the amount of data 
thousand times bigger than your initial dataset. It does not make sense if you 
need just every Nth window, so the data generated will be 1000/N smaller. 



  was:RDDFunctions in MLlib contains sliding window implementation with step 1. 
User should be able to define step. This capability should be implemented.


 Add step capability to RDD sliding window
 -

 Key: SPARK-7316
 URL: https://issues.apache.org/jira/browse/SPARK-7316
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 RDDFunctions in MLlib contains sliding window implementation with step 1. 
 User should be able to define step. This capability should be implemented.
 Although one can generate sliding windows with step 1 and then filter every 
 Nth window, it might take much more time and disk space depending on the step 
 size. For example, if your window is 1000 then you will generate the amount 
 of data thousand times bigger than your initial dataset. It does not make 
 sense if you need just every Nth window, so the data generated will be 1000/N 
 smaller. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7316) Add step capability to RDD sliding window

2015-05-06 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531229#comment-14531229
 ] 

Alexander Ulanov commented on SPARK-7316:
-

I would say that the major use case is practical considerations :)

In my case it is time series analysis of sensor data. It does not make sense to 
analyze time windows with step 1 because it is high-frequency sensor (1024 Hz). 
Also, even if we want to do it, the size of the resulting data gets enormous. 
For example, I have 2B data points (542 hours) of size 23GB binary data. If I 
apply sliding window with size 1024 and step 1, it will result in 
1024*23=23.5TB of data which I am not able to process with Spark currently 
(honestly speaking my disk space is only 10TB). If you store data in HDFS than 
it will be tripled, i.e. 70TB. 


 Add step capability to RDD sliding window
 -

 Key: SPARK-7316
 URL: https://issues.apache.org/jira/browse/SPARK-7316
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 RDDFunctions in MLlib contains sliding window implementation with step 1. 
 User should be able to define step. This capability should be implemented.
 Although one can generate sliding windows with step 1 and then filter every 
 Nth window, it might take much more time and disk space depending on the step 
 size. For example, if your window is 1000 then you will generate the amount 
 of data thousand times bigger than your initial dataset. It does not make 
 sense if you need just every Nth window, so the data generated will be 1000/N 
 smaller. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-05-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538209#comment-14538209
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Current implementation: 
https://github.com/avulanov/spark/tree/ann-interface-gemm

 Artificial neural networks for MLlib deep learning
 --

 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov

 Goal: Implement various types of artificial neural networks
 Motivation: deep learning trend
 Requirements: 
 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
 and Backpropagation etc. should be implemented as traits or interfaces, so 
 they can be easily extended or reused
 2) Implement complex abstractions, such as feed forward and recurrent networks
 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
 autoencoder (sparse and denoising), stacked autoencoder, restricted  
 boltzmann machines (RBM), deep belief networks (DBN) etc.
 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
 poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494518#comment-14494518
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Probably the main issue for MLlib is that iterative algorithms are implemented 
with aggregate function. It has a fixed overhead around half of a second that 
limits its application when one needs to make big number of iterations. This is 
the case for bigger data for which Spark is intended for. This problem gets 
worse with stochastic algorithms because there is no good way to randomly pick 
data from RDD and one needs to sequentially look through it.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:43 PM:
--

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper The tradeoffs of 
large scale learning, SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series. 

Just in case, link to the paper 
http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf


was (Author: avulanov):
The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper The tradeoffs of 
large scale learning, SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494568#comment-14494568
 ] 

Alexander Ulanov commented on SPARK-5256:
-

The size of data that requires to use Spark suggests that learning algorithm 
will be limited by time versus data. According to the paper The tradeoffs of 
large scale learning, SGD has significantly faster convergence than batch GD 
in this case. My use case is machine learning on large data, in particular, 
time series.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov commented on SPARK-5256:
-

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494579#comment-14494579
 ] 

Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:48 PM:
--

[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, shouldn't they? :)


was (Author: avulanov):
[~shivaram] Indeed, performance is orthogonal to the API design. Though 
well-designed things should work efficient, don't you think? :)

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395001#comment-14395001
 ] 

Alexander Ulanov commented on SPARK-6673:
-

Probably similar issue: I am trying to execute unit tests in MLlib with 
LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log 
saying that: Cannot find any assembly build directories. If I do set 
SPARK_SCALA_VERSION=2.10 then I get No assemblies found in 
'C:\dev\spark\mllib\.\assembly\target\scala-2.10'

 spark-shell.cmd can't start even when spark was built in Windows
 

 Key: SPARK-6673
 URL: https://issues.apache.org/jira/browse/SPARK-6673
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.3.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Blocker

 spark-shell.cmd can't start.
 {code}
 bin\spark-shell.cmd --master local
 {code}
 will get
 {code}
 Failed to find Spark assembly JAR.
 You need to build Spark before running this program.
 {code}
 even when we have built spark.
 This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
 is used in {{spark-class2.cmd}}.
 In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
 {{load-spark-env.sh}}, but there are no equivalent script in Windows.
 As workaround, by executing
 {code}
 set SPARK_SCALA_VERSION=2.10
 {code}
 before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395185#comment-14395185
 ] 

Alexander Ulanov commented on SPARK-2356:
-

The following worked for me:
Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and 
put it to DISK:\FOLDERS\bin\
Set HADOOP_CONF=DISK:\FOLDERS

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov edited comment on SPARK-6682 at 4/7/15 6:35 PM:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {

val lbfgs = new LBFGS(_gradient, _updater)

optimizer = lbfgs

lbfgs

  }

```

 Another downside of it is that if someone implements new Optimizer then one 
have to add setMyOptimizer to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {

optimizer match {

  case lbfgs: LBFGS = lbfgs.setGradient(gradient)

  case sgd: GradientDescent = sgd.setGradient(gradient)

  case other = throw new UnsupportedOperationException(

sOnly LBFGS and GradientDescent are supported but got 
${other.getClass}.)

}

  }

```

So it is essential to work out the Optimizer interface first.


was (Author: avulanov):
This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add setMyOptimizer to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS = lbfgs.setGradient(gradient)
  case sgd: GradientDescent = sgd.setGradient(gradient)
  case other = throw new UnsupportedOperationException(
sOnly LBFGS and GradientDescent are supported but got 
${other.getClass}.)
}
  }
```

So it is essential to work out the Optimizer interface first.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep 

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-07 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483729#comment-14483729
 ] 

Alexander Ulanov commented on SPARK-6682:
-

This is a very good idea. Please note though, that there are few issues here
1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as 
constructor parameters. I don't think it is a good idea to force users to 
create Gradient and Updater separately and to be able to create Optimizer. So 
one have to explicitly implement methods like setLBFGSOptimizer or set 
SGDOptimizer and return them so the user will be able to set their parameters.

```
  def LBFGSOptimizer: LBFGS = {
val lbfgs = new LBFGS(_gradient, _updater)
optimizer = lbfgs
lbfgs
  }
```

 Another downside of it is that if someone implements new Optimizer then one 
have to add setMyOptimizer to the builder. The above problems might be solved 
by figuring out a better interface of Optimizer that allows setting its 
parameters without actually creating it.

2) Setting parameters after setting the optimizer: what if user sets the 
Updater after setting the Optimizer? Optimizer takes Updater as a constructor 
parameter! So one has to recreate the corresponding Optimizer.

```
  private[this] def updateGradient(gradient: Gradient): Unit = {
optimizer match {
  case lbfgs: LBFGS = lbfgs.setGradient(gradient)
  case sgd: GradientDescent = sgd.setGradient(gradient)
  case other = throw new UnsupportedOperationException(
sOnly LBFGS and GradientDescent are supported but got 
${other.getClass}.)
}
  }
```

So it is essential to work out the Optimizer interface first.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485554#comment-14485554
 ] 

Alexander Ulanov commented on SPARK-6682:
-

[~yuu.ishik...@gmail.com] 
They reside in package org.apache.spark.mllib.optimization: class LBFGS(private 
var gradient: Gradient, private var updater: Updater) and class GradientDescent 
private[mllib] (private var gradient: Gradient, private var updater: Updater). 
They extend Optimizer trait that has only one function: def optimize(data: 
RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is 
limited to only one type of input: vectors and their labels. I have submitted a 
separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 

1. Right now static methods work with hard-coded optimizers, such as 
LogisticRegressionWithSGD. This is not very convenient. I think moving away 
from static methods and use builders implies that optimizers also could be set 
by users. It will be a problem because current optimizers require Updater and 
Gradient at the creation time. 
2. The workaround I suggested in the previous post addresses this.


 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-8449:
---

 Summary: HDF5 read/write support for Spark MLlib
 Key: SPARK-8449
 URL: https://issues.apache.org/jira/browse/SPARK-8449
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.1


Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS 
and local file system have to be supported. Other Spark formats to be 
discussed. 

Interface proposal:
/* path - directory path in any Hadoop-supported file system URI */
MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
/* path - file or directory path in any Hadoop-supported file system URI */
MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov commented on SPARK-8449:
-

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

 HDF5 read/write support for Spark MLlib
 ---

 Key: SPARK-8449
 URL: https://issues.apache.org/jira/browse/SPARK-8449
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 Add support for reading and writing HDF5 file format to/from LabeledPoint. 
 HDFS and local file system have to be supported. Other Spark formats to be 
 discussed. 
 Interface proposal:
 /* path - directory path in any Hadoop-supported file system URI */
 MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
 /* path - file or directory path in any Hadoop-supported file system URI */
 MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-18 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592398#comment-14592398
 ] 

Alexander Ulanov edited comment on SPARK-8449 at 6/18/15 7:53 PM:
--

It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worths 
a look.


was (Author: avulanov):
It seems that using the official HDF5 reader is not a viable choice for Spark 
due to platform dependent binaries. We need to look for pure Java 
implementation. Apparently, there is one called netCDF: 
http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It 
might be tricky to use it because the license is not Apache. However it worth a 
look.

 HDF5 read/write support for Spark MLlib
 ---

 Key: SPARK-8449
 URL: https://issues.apache.org/jira/browse/SPARK-8449
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 Add support for reading and writing HDF5 file format to/from LabeledPoint. 
 HDFS and local file system have to be supported. Other Spark formats to be 
 discussed. 
 Interface proposal:
 /* path - directory path in any Hadoop-supported file system URI */
 MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
 /* path - file or directory path in any Hadoop-supported file system URI */
 MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-06-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582768#comment-14582768
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Janani, 

There is already an implemenation of DBN (and RBM) by [~gq]. You can find it 
here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

 Artificial neural networks for MLlib deep learning
 --

 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov

 Goal: Implement various types of artificial neural networks
 Motivation: deep learning trend
 Requirements: 
 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
 and Backpropagation etc. should be implemented as traits or interfaces, so 
 they can be easily extended or reused
 2) Implement complex abstractions, such as feed forward and recurrent networks
 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
 autoencoder (sparse and denoising), stacked autoencoder, restricted  
 boltzmann machines (RBM), deep belief networks (DBN) etc.
 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
 poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694356#comment-14694356
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

 User Guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9897
 URL: https://issues.apache.org/jira/browse/SPARK-9897
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
 docs. We should update the user guide to include this under the {{Algorithm 
 Guides  Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9897:

Comment: was deleted

(was: We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?)

 User Guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9897
 URL: https://issues.apache.org/jira/browse/SPARK-9897
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
 docs. We should update the user guide to include this under the {{Algorithm 
 Guides  Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier

2015-08-12 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694355#comment-14694355
 ] 

Alexander Ulanov commented on SPARK-9897:
-

We already have an issue for MLP classifier docs: 
https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. 
Could you close this one?

 User Guide for Multilayer Perceptron Classifier
 ---

 Key: SPARK-9897
 URL: https://issues.apache.org/jira/browse/SPARK-9897
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib 
 docs. We should update the user guide to include this under the {{Algorithm 
 Guides  Algorithms in spark.ml}} section of {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-17 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700567#comment-14700567
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I've submitter a PR for the user guide. Could you suggest if the example code 
in the PR can be used for this issue? https://github.com/apache/spark/pull/8262

 Example code for Multilayer Perceptron Classifier
 -

 Key: SPARK-9951
 URL: https://issues.apache.org/jira/browse/SPARK-9951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov resolved SPARK-9380.
-
Resolution: Fixed

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9380:

Comment: was deleted

(was: It seems that I did not name the PR correctly. I renamed it and resolved 
this issue. Sorry for inconvenience.
)

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646854#comment-14646854
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646853#comment-14646853
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9508) Align graphx programming guide with the updated Pregel code

2015-07-31 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9508:
---

 Summary: Align graphx programming guide with the updated Pregel 
code
 Key: SPARK-9508
 URL: https://issues.apache.org/jira/browse/SPARK-9508
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be 
modified accordingly since it lists the old Pregel code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9436:

Summary: Simplify Pregel by merging joins  (was: Merge joins in Pregel )

 Simplify Pregel by merging joins
 

 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Pregel code contains two consecutive joins: 
 ```
 g.vertices.innerJoin(messages)(vprog)
 ...
 g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
 newOpt.getOrElse(old) }
 ```
 They can be replaced by one join. Ankur Dave proposed a patch based on our 
 discussion in mailing list: 
 https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9436) Merge joins in Pregel

2015-07-29 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9436:
---

 Summary: Merge joins in Pregel 
 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


Pregel code contains two consecutive joins: 
```
g.vertices.innerJoin(messages)(vprog)
...
g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) 
}
```
They can be replaced by one join. Ankur Dave proposed a patch based on our 
discussion in mailing list: 
https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9471:
---

 Summary: Multilayer perceptron 
 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


Implement Multilayer Perceptron for Spark ML. Requirements:
1) ML pipelines interface
2) Extensible internal interface for further development of artificial neural 
networks for ML
3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier

2015-08-14 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697902#comment-14697902
 ] 

Alexander Ulanov commented on SPARK-9951:
-

I have this already, I plan to use it for the User Guide. Should we have a 
different example code in the examples?

 Example code for Multilayer Perceptron Classifier
 -

 Key: SPARK-9951
 URL: https://issues.apache.org/jira/browse/SPARK-9951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Add an example to the examples/ code folder for Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-27 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9380:
---

 Summary: Pregel example fix in graphx-programming-guide
 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


Pregel operator to express single source
shortest path does not work due to incorrect type of the graph: Graph[Int, 
Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642513#comment-14642513
 ] 

Alexander Ulanov commented on SPARK-9273:
-

I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API.

 Add Convolutional Neural network to Spark MLlib
 ---

 Key: SPARK-9273
 URL: https://issues.apache.org/jira/browse/SPARK-9273
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang

 Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-07-27 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642513#comment-14642513
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 7/27/15 9:54 AM:
--

I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API. I've added the link 
to the umbrella issue for deep learning 
https://issues.apache.org/jira/browse/SPARK-5575


was (Author: avulanov):
I have not heard about the PR until it was submitted. It would be useful to 
look at the code, benchmark it and see if it fits our API.

 Add Convolutional Neural network to Spark MLlib
 ---

 Key: SPARK-9273
 URL: https://issues.apache.org/jira/browse/SPARK-9273
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang

 Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam

2015-07-16 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9118:
---

 Summary: Implement integer array parameters for ml.param as 
IntArrayParam
 Key: SPARK-9118
 URL: https://issues.apache.org/jira/browse/SPARK-9118
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.4.0


ml/param/params.scala lacks integer array parameter. It is needed for some 
models such as multilayer perceptron to specify the layer sizes. I suggest to 
implement it as IntArrayParam similarly to DoubleArrayParam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630654#comment-14630654
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thank you, it sounds doable.

 Add multivariate regression (or prediction) interface
 -

 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
 single variable with a method predict:Double by extending the Predictor. 
 There is a need for multivariate prediction, at least for regression. I 
 propose to modify RegressionModel interface similarly to how it is done in 
 ClassificationModel, which supports multiclass classification. It has 
 predict:Double and predictRaw:Vector. Analogously, RegressionModel 
 should have something like predictMultivariate:Vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630596#comment-14630596
 ] 

Alexander Ulanov commented on SPARK-9120:
-

I think it should work for the train (aka fit) that has to return the model, 
not sure about the model itself. The common ancestor Model does not contain 
anything that can be called for prediction, its direct successor 
PredictionModel has predict:Double. Is there another way that you were 
mentioning?

 Add multivariate regression (or prediction) interface
 -

 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
 single variable with a method predict:Double by extending the Predictor. 
 There is a need for multivariate prediction, at least for regression. I 
 propose to modify RegressionModel interface similarly to how it is done in 
 ClassificationModel, which supports multiclass classification. It has 
 predict:Double and predictRaw:Vector. Analogously, RegressionModel 
 should have something like predictMultivariate:Vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630453#comment-14630453
 ] 

Alexander Ulanov commented on SPARK-3702:
-

[~josephkb] Hi, Joseph! Do you plan to add support for multivariate regression? 
I need this for multi-layer perceptron. Multivariate regression interface might 
be useful for other tasks. I've added an issue 
https://issues.apache.org/jira/browse/SPARK-9120. Also I wonder if you plan to 
add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118. 
Both seems to be relatively easy to implement, the question is do you plan to 
merge these features in the near future?

 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630560#comment-14630560
 ] 

Alexander Ulanov commented on SPARK-9120:
-

Thank you for sharing your thoughts. Do you mean that the algorithm that does 
multivariate regression should not be implemented within ML since ML does not 
support multivariate, so the algorithm should live within MLlib for a while 
until you figure out a generic interface? By support I mean handling the .fit 
and .transform staff etc.

 Add multivariate regression (or prediction) interface
 -

 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
 single variable with a method predict:Double by extending the Predictor. 
 There is a need for multivariate prediction, at least for regression. I 
 propose to modify RegressionModel interface similarly to how it is done in 
 ClassificationModel, which supports multiclass classification. It has 
 predict:Double and predictRaw:Vector. Analogously, RegressionModel 
 should have something like predictMultivariate:Vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-16 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9120:
---

 Summary: Add multivariate regression (or prediction) interface
 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method predict:Double by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify RegressionModel interface similarly to how it is done in 
ClassificationModel, which supports multiclass classification. It has 
predict:Double and predictRaw:Vector. Analogously, RegressionModel should 
have something like predictMultivariate:Vector.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-21 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method predict:Double by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify RegressionModel interface similarly to how it is done in 
ClassificationModel, which supports multiclass classification. It has 
predict:Double and predictRaw:Vector. Analogously, RegressionModel should 
have something like predictMultivariate:Vector.

Update:After reading the design docs, adding predictMultivariate to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has predict:Double. 
Its train method uses predict:Double for prediction, i.e. PredictionModel 
is hard-coded to have only one output. It is the same problem that I pointed 
out long time ago in MLLib (


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method predict:Double by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify RegressionModel interface similarly to how it is done in 
ClassificationModel, which supports multiclass classification. It has 
predict:Double and predictRaw:Vector. Analogously, RegressionModel should 
have something like predictMultivariate:Vector.



 Add multivariate regression (or prediction) interface
 -

 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
 single variable with a method predict:Double by extending the Predictor. 
 There is a need for multivariate prediction, at least for regression. I 
 propose to modify RegressionModel interface similarly to how it is done in 
 ClassificationModel, which supports multiclass classification. It has 
 predict:Double and predictRaw:Vector. Analogously, RegressionModel 
 should have something like predictMultivariate:Vector.
 Update:After reading the design docs, adding predictMultivariate to 
 RegressionModel does not seem reasonable to me anymore. The issue is as 
 follows. RegressionModel extends PredictionModel which has predict:Double. 
 Its train method uses predict:Double for prediction, i.e. PredictionModel 
 is hard-coded to have only one output. It is the same problem that I pointed 
 out long time ago in MLLib (



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2015-07-21 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9120:

Description: 
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method predict:Double by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify RegressionModel interface similarly to how it is done in 
ClassificationModel, which supports multiclass classification. It has 
predict:Double and predictRaw:Vector. Analogously, RegressionModel should 
have something like predictMultivariate:Vector.

Update: After reading the design docs, adding predictMultivariate to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has predict:Double. 
Its train method uses predict:Double for prediction, i.e. PredictionModel 
(and RegressionModel) is hard-coded to have only one output. There exist a 
similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 

The possible solution for this problem might require to redesign the class 
hierarchy or addition of a separate interface that extends model. Though the 
latter means code duplication.


  was:
org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
single variable with a method predict:Double by extending the Predictor. 
There is a need for multivariate prediction, at least for regression. I propose 
to modify RegressionModel interface similarly to how it is done in 
ClassificationModel, which supports multiclass classification. It has 
predict:Double and predictRaw:Vector. Analogously, RegressionModel should 
have something like predictMultivariate:Vector.

Update:After reading the design docs, adding predictMultivariate to 
RegressionModel does not seem reasonable to me anymore. The issue is as 
follows. RegressionModel extends PredictionModel which has predict:Double. 
Its train method uses predict:Double for prediction, i.e. PredictionModel 
is hard-coded to have only one output. It is the same problem that I pointed 
out long time ago in MLLib (



 Add multivariate regression (or prediction) interface
 -

 Key: SPARK-9120
 URL: https://issues.apache.org/jira/browse/SPARK-9120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
 single variable with a method predict:Double by extending the Predictor. 
 There is a need for multivariate prediction, at least for regression. I 
 propose to modify RegressionModel interface similarly to how it is done in 
 ClassificationModel, which supports multiclass classification. It has 
 predict:Double and predictRaw:Vector. Analogously, RegressionModel 
 should have something like predictMultivariate:Vector.
 Update: After reading the design docs, adding predictMultivariate to 
 RegressionModel does not seem reasonable to me anymore. The issue is as 
 follows. RegressionModel extends PredictionModel which has predict:Double. 
 Its train method uses predict:Double for prediction, i.e. PredictionModel 
 (and RegressionModel) is hard-coded to have only one output. There exist a 
 similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). 
 The possible solution for this problem might require to redesign the class 
 hierarchy or addition of a separate interface that extends model. Though the 
 latter means code duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-11262:


 Summary: Unit test for gradient, loss layers, memory management 
for multilayer perceptron
 Key: SPARK-11262
 URL: https://issues.apache.org/jira/browse/SPARK-11262
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.1
Reporter: Alexander Ulanov
 Fix For: 1.5.1


Multi-layer perceptron requires more rigorous tests and refactoring of layer 
interfaces to accommodate development of new features.
1)Implement unit test for gradient and loss
2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997300#comment-14997300
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Narine,

Thank you for your observation. It seems that such information is useful to 
know. Indeed, LBFGS in Spark does not print any information during the 
execution. ANN uses Spark's LBFGS. You might want to add the needed output to 
the LBFGS code 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L185.
 

Best regards, Alexander 


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-11 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1, 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-13 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1. Vincent, Pascal, et al. "Extracting and composing robust features with 
denoising autoencoders." Proceedings of the 25th international conference on 
Machine learning. ACM, 2008. 
http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
 
2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988705#comment-14988705
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Disha,

RNN is a major feature. I suggest to start from a smaller contribution. Spark 
contains the implementation of multi-layer perceptron since version 1.5. New 
features are supposed to re-use its code and follow the internal API that it 
has introduced. 

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992447#comment-14992447
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992447#comment-14992447
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 11/5/15 8:50 PM:
--

Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. Does it work with LBFGS?

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.


was (Author: avulanov):
Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal 
interface. Also, I was able to run your example. It shows increasing accuracy 
while training however it is not very fast. 

There is a good explanation how to use matrices multiplication in convolution: 
http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll 
all image patches (regions that will be convolved) the into vectors and stack 
them together in a matrix. The weights of convolutional layer also should be 
rolled into vectors and stacked. Multiplying two mentioned matrices provides 
the convolution result that can be unrolled to 3d matrix, however it would not 
be necessary for this implementation. We can discuss it offline if you wish.

Besides the optimization, there are few more things to be done. It includes 
unit tests for new layers, gradient test, representing pooling layer as 
functional layer, and performance comparison with the other implementation of 
CNN. You can take a look at the tests I've added for MLP 
https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at 
https://github.com/avulanov/ann-benchmark. A separate branch/repo for these 
developments might be a good thing to do. I'll be happy to help you with this.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10409:


 Summary: Multilayer perceptron regression
 Key: SPARK-10409
 URL: https://issues.apache.org/jira/browse/SPARK-10409
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Implement regression based on multilayer perceptron (MLP). It should support 
different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The 
implementation might take advantage of autoencoder. Time-series forecasting for 
financial data might be one of the use cases, see 
http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726435#comment-14726435
 ] 

Alexander Ulanov commented on SPARK-10409:
--

Basic implementation with the current ML api can be found here: 
https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers: 

References: 
1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp (


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423
 ] 

Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM:
---

Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp 
(https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala)


was (Author: avulanov):
Added implementation for (1) that is basic deep autoencoder 
https://github.com/avulanov/spark/tree/autoencoder-mlp

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Issue Type: Umbrella  (was: Improvement)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf]
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here
> 3)Denoising autoencoder
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10408) Autoencoder

2015-09-01 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10408:


 Summary: Autoencoder
 Key: SPARK-10408
 URL: https://issues.apache.org/jira/browse/SPARK-10408
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf]
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here
3)Denoising autoencoder
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-02 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10324:
-
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You 
can find a complete list 
[here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
into two major categories:

* Python API for new algorithms
* Python API for missing methods

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since versions in spark.ml, pyspark.mllib, and 

[jira] [Closed] (SPARK-4752) Classifier based on artificial neural network

2015-09-08 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov closed SPARK-4752.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746636#comment-14746636
 ] 

Alexander Ulanov commented on SPARK-10627:
--

Dropout WIP refactoring for the new ML API 
https://github.com/avulanov/spark/tree/dropout-mlp. 

> Regularization for artificial neural networks
> -
>
> Key: SPARK-10627
> URL: https://issues.apache.org/jira/browse/SPARK-10627
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Add regularization for artificial neural networks. Includes, but not limited 
> to:
> 1)L1 and L2 regularization
> 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
> 3)Dropconnect 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10627) Regularization for artificial neural networks

2015-09-15 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-10627:


 Summary: Regularization for artificial neural networks
 Key: SPARK-10627
 URL: https://issues.apache.org/jira/browse/SPARK-10627
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Affects Versions: 1.5.0
Reporter: Alexander Ulanov
Priority: Minor


Add regularization for artificial neural networks. Includes, but not limited to:
1)L1 and L2 regularization
2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
3)Dropconnect 
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-09-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737948#comment-14737948
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Do you think that it is reasonable?


> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-09-09 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737948#comment-14737948
 ] 

Alexander Ulanov edited comment on SPARK-9273 at 9/10/15 1:18 AM:
--

Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Does it make sense to you?



was (Author: avulanov):
Hi Yuhao! I have few comments regarding the interface and the optimization of 
your implementation. There are two options of optimizing convolutions: using 
matrix-matrix multiplication and using FFTs. The latter seems a bit more 
complicated since we don't have optimized parallel FFT in Spark. It also has to 
support batch data processing. Instead, if one uses matrix-matrix 
multiplication for convolution, then it can take advantage of native BLAS and 
batch computations can be supported straightforward. Another benefit is that we 
would not need to change the current Layer's input/ouptput type (matrix) to 
tensor. We can store the unwrapped inputs/outputs as vectors within the 
input/output matrix. Do you think that it is reasonable?


> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-01 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940446#comment-14940446
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi, Weide,

Sounds good! What kind of feature are you planning to add?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-10-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944173#comment-14944173
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Weide,

These are major features and some of them are under development. You can check 
their status in the linked issues. Could you work on something smaller as a 
first step? [~mengxr], do you have any suggestions?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-10 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15893:


 Summary: spark.createDataFrame raises an exception in Spark 2.0 
tests on Windows
 Key: SPARK-15893
 URL: https://issues.apache.org/jira/browse/SPARK-15893
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.0.0
Reporter: Alexander Ulanov


spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

For example, LogisticRegressionSuite fails at Line 46:
Exception encountered when invoking run on a nested suite - 
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)


Another example, DataFrameSuite raises:
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-10 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325377#comment-15325377
 ] 

Alexander Ulanov commented on SPARK-15581:
--

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If 

[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-09 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15851:


 Summary: Spark 2.0 does not compile in Windows 7
 Key: SPARK-15851
 URL: https://issues.apache.org/jira/browse/SPARK-15851
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
 Environment: Windows 7
Reporter: Alexander Ulanov


Spark does not compile in Windows 7.
"mvn compile" fails on spark-core due to trying to execute a bash script 
spark-build-info.

Work around:
1)Install win-bash and put in path
2)Change line 350 of core/pom.xml

  
  
  


Error trace:
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
spark-core_2.11: An Ant BuildException has occured: Execute failed: 
java.io.IOException: Cannot run program 
"C:\dev\spark\core\..\build\spark-build-info" (in directory 
"C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
application
[ERROR] around Ant part .. @ 4:73 in 
C:\dev\spark\core\target\antrun\build-main.xml




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >