[jira] [Created] (SPARK-2222) Add multiclass evaluation metrics
Alexander Ulanov created SPARK-: --- Summary: Add multiclass evaluation metrics Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alexander Ulanov There is no class in Spark MLlib for measuring the performance of multiclass classifiers. This task involves adding such class and unit tests. The following measures are to be implemented: per class, micro averaged and weighted averaged Precision, Recall and F1-Measure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2329) Add multi-label evaluation metrics
Alexander Ulanov created SPARK-2329: --- Summary: Add multi-label evaluation metrics Key: SPARK-2329 URL: https://issues.apache.org/jira/browse/SPARK-2329 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alexander Ulanov Fix For: 1.1.0 There is no class in Spark MLlib for measuring the performance of multi-label classifiers. Multilabel classification is when the document is labeled with several labels (classes). This task involves adding the class for multilabel evaluation and unit tests. The following measures are to be implemented: Precision, Recall and F1-measure (1) based on documents averaged by the number of documents; (2) per label; (3) based on labels micro and macro averaged; (4) Hamming loss. Reference: Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. Data mining and knowledge discovery handbook. Springer US, 2010. 667-685. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049939#comment-14049939 ] Alexander Ulanov commented on SPARK-1473: - Does anybody work on this issue? Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Priority: Minor Labels: features Fix For: 1.1.0 For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473 ] Alexander Ulanov commented on SPARK-1473: - I've implemented Chi-Squared and added a pull request Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Priority: Minor Labels: features Fix For: 1.1.0 For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090473#comment-14090473 ] Alexander Ulanov edited comment on SPARK-1473 at 8/8/14 8:27 AM: - I've implemented Chi-Squared and added a pull request https://github.com/apache/spark/pull/1484 was (Author: avulanov): I've implemented Chi-Squared and added a pull request Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Priority: Minor Labels: features Fix For: 1.1.0 For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
Alexander Ulanov created SPARK-3403: --- Summary: NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-3403: Attachment: NativeNN.scala The file contains example that produces the same issue NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121563#comment-14121563 ] Alexander Ulanov commented on SPARK-3403: - Yes, I tried using netlib-java separately with the same OpenBLAS setup and it worked properly, even within several threads. However I didn't mimic the same multi-threading setup as MLlib has because it is complicated. Do you want me to send you all DLLs that I used? I had troubles with compiling OpenBLAS for Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites. NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699 ] Alexander Ulanov commented on SPARK-3403: - I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699 ] Alexander Ulanov edited comment on SPARK-3403 at 9/5/14 9:53 AM: - I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? 3)I didn't get any performance improvements with native libraries versus java arrays. My matrices are of size up to 10K-20K . Is it supposed to be so? was (Author: avulanov): I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single threaded dll. With this dll my tests didn't fail and seem to be executed properly. Thank you for suggestion! 1)Do you think that the same issue will remain in Linux? 2)What are the performance implications when using single threaded OpenBLAS through breeze? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829 ] Alexander Ulanov commented on SPARK-3403: - Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java (https://github.com/fommil/netlib-java)? I thought the latter has jni implementation. I it ok to submit it as is? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140128#comment-14140128 ] Alexander Ulanov commented on SPARK-3403: - Posted to netlib-java: https://github.com/fommil/netlib-java/issues/72 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829 ] Alexander Ulanov edited comment on SPARK-3403 at 9/19/14 7:16 AM: -- Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS ( https://github.com/xianyi/OpenBLAS ) or netlib-java ( https://github.com/fommil/netlib-java )? I thought the latter has jni implementation. I it ok to submit it as is? was (Author: avulanov): Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java (https://github.com/fommil/netlib-java)? I thought the latter has jni implementation. I it ok to submit it as is? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4752) Classifier based on artificial neural network
Alexander Ulanov created SPARK-4752: --- Summary: Classifier based on artificial neural network Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855 ] Alexander Ulanov edited comment on SPARK-4752 at 12/5/14 12:51 AM: --- The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It encodes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. was (Author: avulanov): The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It codes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. Classifier based on artificial neural network - Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 168h Remaining Estimate: 168h Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855 ] Alexander Ulanov commented on SPARK-4752: - The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It codes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. Classifier based on artificial neural network - Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 168h Remaining Estimate: 168h Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289677#comment-14289677 ] Alexander Ulanov commented on SPARK-5386: - My spark-env.sh contains: export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=8g export SPARK_WORKER_INSTANCES=2 I run spark-shell with ./spark-shell --executor-memory 8G --driver-memory 8G. In Spark-UI each worker has 8GB of memory. Btw., I run this code once again and this time it does not crash and keep trying to shedule the job for the failing node that tries to allocate memory and fails and so on. Is it a normal behavior? Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5386: Description: Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed was: Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double](n)) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers ./spark-shell --executor-memory 8G --driver-memory 8G Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5386) Reduce fails with vectors of big length
Alexander Ulanov created SPARK-5386: --- Summary: Reduce fails with vectors of big length Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers ./spark-shell --executor-memory 8G --driver-memory 8G Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double](n)) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5386: Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space was: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers ./spark-shell --executor-memory 8G --driver-memory 8G Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289621#comment-14289621 ] Alexander Ulanov commented on SPARK-5386: - I allocate 8G for driver and each worker. Could you suggest why it is not enough for handling reduce operation with 60M vector of Double? Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708 ] Alexander Ulanov commented on SPARK-5386: - Thank you for suggestions. 1. count() does work, it returns 12 2. It failed with p = 2. However, in some of my previous experiments it did not fail even for p up to 5 or 7 (in different runs) Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708 ] Alexander Ulanov edited comment on SPARK-5386 at 1/23/15 6:52 PM: -- Thank you for suggestions. 1. count() does work, it returns 12 2. Full script failed with p = 2. However, in some of my previous experiments it did not fail even for p up to 5 or 7 (in different runs) was (Author: avulanov): Thank you for suggestions. 1. count() does work, it returns 12 2. It failed with p = 2. However, in some of my previous experiments it did not fail even for p up to 5 or 7 (in different runs) Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5386: Description: Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed was: Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289880#comment-14289880 ] Alexander Ulanov commented on SPARK-5386: - Thank you, it might be the problem. I was trying to run GC before each operation but it did not help. Probably, it takes a lot of memory to run initialization of Breeze Dense Vector. Assuming that the problem is due to insufficient memory on the Worker node, I am curious, what will happen on Driver? Will it receive 12 vectors of size 60M Doubles and then do the aggregation? Is it feasible? (P.S. I know that there is a treeReduce function that forces do partial aggregation on Workers. However, for big number of Wokers the problem will remain in treeReduce as well, as far as I understand) Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5575) Artificial neural networks for MLlib deep learning
Alexander Ulanov created SPARK-5575: --- Summary: Artificial neural networks for MLlib deep learning Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277986#comment-14277986 ] Alexander Ulanov commented on SPARK-5256: - I would like to improve Gradient interface, so it will be able to process something more general than `Label` (which is relevant only to classifiers but not to other machine learning methods) and also allowing batch processing. The simplest way for me of doing this is to add another function to `Gradient` interface: def compute(data: Vector, output: Vector, weights: Vector, cumGradient: Vector): Double In `Gradient` trait it should call `compute` with `label`. Of course, one needs to make some adjustments to LBFGS and GradientDescent optimizers, replacing label: double with output:vector. For batch processing one can put data and output points stacked into a long vector (matrices are stored in this way in breeze) and pass them with the proposed interface. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277988#comment-14277988 ] Alexander Ulanov commented on SPARK-5256: - Also, asynchronous gradient update might be a good thing to have. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
Alexander Ulanov created SPARK-5362: --- Summary: Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286703#comment-14286703 ] Alexander Ulanov commented on SPARK-5362: - https://github.com/apache/spark/pull/4152 Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286706#comment-14286706 ] Alexander Ulanov commented on SPARK-5256: - I've implemented my proposition with Vector as output in https://issues.apache.org/jira/browse/SPARK-5362 Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5912) Programming guide for feature selection
[ https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328246#comment-14328246 ] Alexander Ulanov commented on SPARK-5912: - Sure, I can. Could you point me to some template or a good example of a programming guide? Programming guide for feature selection --- Key: SPARK-5912 URL: https://issues.apache.org/jira/browse/SPARK-5912 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The new ChiSqSelector for feature selection should have a section in the Programming Guide. It should probably be under the feature extraction and transformation section as a new subsection for feature selection. If we get more feature selection methods later on, we could expand it to a larger section of the guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7316) Add step capability to RDD sliding window
[ https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-7316: Description: RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Although one can generate sliding windows with step 1 and then filter every Nth window, it might take much more time and disk space depending on the step size. For example, if your window is 1000 then you will generate the amount of data thousand times bigger than your initial dataset. It does not make sense if you need just every Nth window, so the data generated will be 1000/N smaller. was:RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Add step capability to RDD sliding window - Key: SPARK-7316 URL: https://issues.apache.org/jira/browse/SPARK-7316 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 24h Remaining Estimate: 24h RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Although one can generate sliding windows with step 1 and then filter every Nth window, it might take much more time and disk space depending on the step size. For example, if your window is 1000 then you will generate the amount of data thousand times bigger than your initial dataset. It does not make sense if you need just every Nth window, so the data generated will be 1000/N smaller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7316) Add step capability to RDD sliding window
[ https://issues.apache.org/jira/browse/SPARK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531229#comment-14531229 ] Alexander Ulanov commented on SPARK-7316: - I would say that the major use case is practical considerations :) In my case it is time series analysis of sensor data. It does not make sense to analyze time windows with step 1 because it is high-frequency sensor (1024 Hz). Also, even if we want to do it, the size of the resulting data gets enormous. For example, I have 2B data points (542 hours) of size 23GB binary data. If I apply sliding window with size 1024 and step 1, it will result in 1024*23=23.5TB of data which I am not able to process with Spark currently (honestly speaking my disk space is only 10TB). If you store data in HDFS than it will be tripled, i.e. 70TB. Add step capability to RDD sliding window - Key: SPARK-7316 URL: https://issues.apache.org/jira/browse/SPARK-7316 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 24h Remaining Estimate: 24h RDDFunctions in MLlib contains sliding window implementation with step 1. User should be able to define step. This capability should be implemented. Although one can generate sliding windows with step 1 and then filter every Nth window, it might take much more time and disk space depending on the step size. For example, if your window is 1000 then you will generate the amount of data thousand times bigger than your initial dataset. It does not make sense if you need just every Nth window, so the data generated will be 1000/N smaller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538209#comment-14538209 ] Alexander Ulanov commented on SPARK-5575: - Current implementation: https://github.com/avulanov/spark/tree/ann-interface-gemm Artificial neural networks for MLlib deep learning -- Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494518#comment-14494518 ] Alexander Ulanov commented on SPARK-5256: - Probably the main issue for MLlib is that iterative algorithms are implemented with aggregate function. It has a fixed overhead around half of a second that limits its application when one needs to make big number of iterations. This is the case for bigger data for which Spark is intended for. This problem gets worse with stochastic algorithms because there is no good way to randomly pick data from RDD and one needs to sequentially look through it. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494568#comment-14494568 ] Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:43 PM: -- The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper The tradeoffs of large scale learning, SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. Just in case, link to the paper http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf was (Author: avulanov): The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper The tradeoffs of large scale learning, SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494568#comment-14494568 ] Alexander Ulanov commented on SPARK-5256: - The size of data that requires to use Spark suggests that learning algorithm will be limited by time versus data. According to the paper The tradeoffs of large scale learning, SGD has significantly faster convergence than batch GD in this case. My use case is machine learning on large data, in particular, time series. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494579#comment-14494579 ] Alexander Ulanov commented on SPARK-5256: - [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, don't you think? :) Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494579#comment-14494579 ] Alexander Ulanov edited comment on SPARK-5256 at 4/14/15 6:48 PM: -- [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, shouldn't they? :) was (Author: avulanov): [~shivaram] Indeed, performance is orthogonal to the API design. Though well-designed things should work efficient, don't you think? :) Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows
[ https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395001#comment-14395001 ] Alexander Ulanov commented on SPARK-6673: - Probably similar issue: I am trying to execute unit tests in MLlib with LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log saying that: Cannot find any assembly build directories. If I do set SPARK_SCALA_VERSION=2.10 then I get No assemblies found in 'C:\dev\spark\mllib\.\assembly\target\scala-2.10' spark-shell.cmd can't start even when spark was built in Windows Key: SPARK-6673 URL: https://issues.apache.org/jira/browse/SPARK-6673 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.3.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Blocker spark-shell.cmd can't start. {code} bin\spark-shell.cmd --master local {code} will get {code} Failed to find Spark assembly JAR. You need to build Spark before running this program. {code} even when we have built spark. This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which is used in {{spark-class2.cmd}}. In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in {{load-spark-env.sh}}, but there are no equivalent script in Windows. As workaround, by executing {code} set SPARK_SCALA_VERSION=2.10 {code} before execute spark-shell.cmd, we can successfully start it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395185#comment-14395185 ] Alexander Ulanov commented on SPARK-2356: - The following worked for me: Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and put it to DISK:\FOLDERS\bin\ Set HADOOP_CONF=DISK:\FOLDERS Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483729#comment-14483729 ] Alexander Ulanov edited comment on SPARK-6682 at 4/7/15 6:35 PM: - This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add setMyOptimizer to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS = lbfgs.setGradient(gradient) case sgd: GradientDescent = sgd.setGradient(gradient) case other = throw new UnsupportedOperationException( sOnly LBFGS and GradientDescent are supported but got ${other.getClass}.) } } ``` So it is essential to work out the Optimizer interface first. was (Author: avulanov): This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add setMyOptimizer to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS = lbfgs.setGradient(gradient) case sgd: GradientDescent = sgd.setGradient(gradient) case other = throw new UnsupportedOperationException( sOnly LBFGS and GradientDescent are supported but got ${other.getClass}.) } } ``` So it is essential to work out the Optimizer interface first. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483729#comment-14483729 ] Alexander Ulanov commented on SPARK-6682: - This is a very good idea. Please note though, that there are few issues here 1) Setting optimizer: optimizers (LBFGS and SGD) have Gradient and Updater as constructor parameters. I don't think it is a good idea to force users to create Gradient and Updater separately and to be able to create Optimizer. So one have to explicitly implement methods like setLBFGSOptimizer or set SGDOptimizer and return them so the user will be able to set their parameters. ``` def LBFGSOptimizer: LBFGS = { val lbfgs = new LBFGS(_gradient, _updater) optimizer = lbfgs lbfgs } ``` Another downside of it is that if someone implements new Optimizer then one have to add setMyOptimizer to the builder. The above problems might be solved by figuring out a better interface of Optimizer that allows setting its parameters without actually creating it. 2) Setting parameters after setting the optimizer: what if user sets the Updater after setting the Optimizer? Optimizer takes Updater as a constructor parameter! So one has to recreate the corresponding Optimizer. ``` private[this] def updateGradient(gradient: Gradient): Unit = { optimizer match { case lbfgs: LBFGS = lbfgs.setGradient(gradient) case sgd: GradientDescent = sgd.setGradient(gradient) case other = throw new UnsupportedOperationException( sOnly LBFGS and GradientDescent are supported but got ${other.getClass}.) } } ``` So it is essential to work out the Optimizer interface first. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485554#comment-14485554 ] Alexander Ulanov commented on SPARK-6682: - [~yuu.ishik...@gmail.com] They reside in package org.apache.spark.mllib.optimization: class LBFGS(private var gradient: Gradient, private var updater: Updater) and class GradientDescent private[mllib] (private var gradient: Gradient, private var updater: Updater). They extend Optimizer trait that has only one function: def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is limited to only one type of input: vectors and their labels. I have submitted a separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 1. Right now static methods work with hard-coded optimizers, such as LogisticRegressionWithSGD. This is not very convenient. I think moving away from static methods and use builders implies that optimizers also could be set by users. It will be a problem because current optimizers require Updater and Gradient at the creation time. 2. The workaround I suggested in the previous post addresses this. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8449) HDF5 read/write support for Spark MLlib
Alexander Ulanov created SPARK-8449: --- Summary: HDF5 read/write support for Spark MLlib Key: SPARK-8449 URL: https://issues.apache.org/jira/browse/SPARK-8449 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.1 Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS and local file system have to be supported. Other Spark formats to be discussed. Interface proposal: /* path - directory path in any Hadoop-supported file system URI */ MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit /* path - file or directory path in any Hadoop-supported file system URI */ MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8449) HDF5 read/write support for Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592398#comment-14592398 ] Alexander Ulanov commented on SPARK-8449: - It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worth a look. HDF5 read/write support for Spark MLlib --- Key: SPARK-8449 URL: https://issues.apache.org/jira/browse/SPARK-8449 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.1 Original Estimate: 96h Remaining Estimate: 96h Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS and local file system have to be supported. Other Spark formats to be discussed. Interface proposal: /* path - directory path in any Hadoop-supported file system URI */ MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit /* path - file or directory path in any Hadoop-supported file system URI */ MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8449) HDF5 read/write support for Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592398#comment-14592398 ] Alexander Ulanov edited comment on SPARK-8449 at 6/18/15 7:53 PM: -- It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worths a look. was (Author: avulanov): It seems that using the official HDF5 reader is not a viable choice for Spark due to platform dependent binaries. We need to look for pure Java implementation. Apparently, there is one called netCDF: http://www.unidata.ucar.edu/blogs/news/entry/netcdf_java_library_version_44. It might be tricky to use it because the license is not Apache. However it worth a look. HDF5 read/write support for Spark MLlib --- Key: SPARK-8449 URL: https://issues.apache.org/jira/browse/SPARK-8449 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.1 Original Estimate: 96h Remaining Estimate: 96h Add support for reading and writing HDF5 file format to/from LabeledPoint. HDFS and local file system have to be supported. Other Spark formats to be discussed. Interface proposal: /* path - directory path in any Hadoop-supported file system URI */ MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit /* path - file or directory path in any Hadoop-supported file system URI */ MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582768#comment-14582768 ] Alexander Ulanov commented on SPARK-5575: - Hi Janani, There is already an implemenation of DBN (and RBM) by [~gq]. You can find it here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn Artificial neural networks for MLlib deep learning -- Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694356#comment-14694356 ] Alexander Ulanov commented on SPARK-9897: - We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one? User Guide for Multilayer Perceptron Classifier --- Key: SPARK-9897 URL: https://issues.apache.org/jira/browse/SPARK-9897 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib docs. We should update the user guide to include this under the {{Algorithm Guides Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9897: Comment: was deleted (was: We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one?) User Guide for Multilayer Perceptron Classifier --- Key: SPARK-9897 URL: https://issues.apache.org/jira/browse/SPARK-9897 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib docs. We should update the user guide to include this under the {{Algorithm Guides Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9897) User Guide for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694355#comment-14694355 ] Alexander Ulanov commented on SPARK-9897: - We already have an issue for MLP classifier docs: https://issues.apache.org/jira/browse/SPARK-9846. I plan to resolve it soon. Could you close this one? User Guide for Multilayer Perceptron Classifier --- Key: SPARK-9897 URL: https://issues.apache.org/jira/browse/SPARK-9897 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-9471 adds MLPs to ML Pipelines, an algorithm not covered by the MLlib docs. We should update the user guide to include this under the {{Algorithm Guides Algorithms in spark.ml}} section of {{ml-guide}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700567#comment-14700567 ] Alexander Ulanov commented on SPARK-9951: - I've submitter a PR for the user guide. Could you suggest if the example code in the PR can be used for this issue? https://github.com/apache/spark/pull/8262 Example code for Multilayer Perceptron Classifier - Key: SPARK-9951 URL: https://issues.apache.org/jira/browse/SPARK-9951 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov resolved SPARK-9380. - Resolution: Fixed Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9380: Comment: was deleted (was: It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. ) Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646854#comment-14646854 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646853#comment-14646853 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9508) Align graphx programming guide with the updated Pregel code
Alexander Ulanov created SPARK-9508: --- Summary: Align graphx programming guide with the updated Pregel code Key: SPARK-9508 URL: https://issues.apache.org/jira/browse/SPARK-9508 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins
[ https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9436: Summary: Simplify Pregel by merging joins (was: Merge joins in Pregel ) Simplify Pregel by merging joins Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9436) Merge joins in Pregel
Alexander Ulanov created SPARK-9436: --- Summary: Merge joins in Pregel Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9471) Multilayer perceptron
Alexander Ulanov created SPARK-9471: --- Summary: Multilayer perceptron Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9951) Example code for Multilayer Perceptron Classifier
[ https://issues.apache.org/jira/browse/SPARK-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697902#comment-14697902 ] Alexander Ulanov commented on SPARK-9951: - I have this already, I plan to use it for the User Guide. Should we have a different example code in the examples? Example code for Multilayer Perceptron Classifier - Key: SPARK-9951 URL: https://issues.apache.org/jira/browse/SPARK-9951 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Add an example to the examples/ code folder for Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9380) Pregel example fix in graphx-programming-guide
Alexander Ulanov created SPARK-9380: --- Summary: Pregel example fix in graphx-programming-guide Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642513#comment-14642513 ] Alexander Ulanov commented on SPARK-9273: - I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. Add Convolutional Neural network to Spark MLlib --- Key: SPARK-9273 URL: https://issues.apache.org/jira/browse/SPARK-9273 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642513#comment-14642513 ] Alexander Ulanov edited comment on SPARK-9273 at 7/27/15 9:54 AM: -- I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. I've added the link to the umbrella issue for deep learning https://issues.apache.org/jira/browse/SPARK-5575 was (Author: avulanov): I have not heard about the PR until it was submitted. It would be useful to look at the code, benchmark it and see if it fits our API. Add Convolutional Neural network to Spark MLlib --- Key: SPARK-9273 URL: https://issues.apache.org/jira/browse/SPARK-9273 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9118) Implement integer array parameters for ml.param as IntArrayParam
Alexander Ulanov created SPARK-9118: --- Summary: Implement integer array parameters for ml.param as IntArrayParam Key: SPARK-9118 URL: https://issues.apache.org/jira/browse/SPARK-9118 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.4.0 ml/param/params.scala lacks integer array parameter. It is needed for some models such as multilayer perceptron to specify the layer sizes. I suggest to implement it as IntArrayParam similarly to DoubleArrayParam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630654#comment-14630654 ] Alexander Ulanov commented on SPARK-9120: - Thank you, it sounds doable. Add multivariate regression (or prediction) interface - Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630596#comment-14630596 ] Alexander Ulanov commented on SPARK-9120: - I think it should work for the train (aka fit) that has to return the model, not sure about the model itself. The common ancestor Model does not contain anything that can be called for prediction, its direct successor PredictionModel has predict:Double. Is there another way that you were mentioning? Add multivariate regression (or prediction) interface - Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630453#comment-14630453 ] Alexander Ulanov commented on SPARK-3702: - [~josephkb] Hi, Joseph! Do you plan to add support for multivariate regression? I need this for multi-layer perceptron. Multivariate regression interface might be useful for other tasks. I've added an issue https://issues.apache.org/jira/browse/SPARK-9120. Also I wonder if you plan to add integer array parameters: https://issues.apache.org/jira/browse/SPARK-9118. Both seems to be relatively easy to implement, the question is do you plan to merge these features in the near future? Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630560#comment-14630560 ] Alexander Ulanov commented on SPARK-9120: - Thank you for sharing your thoughts. Do you mean that the algorithm that does multivariate regression should not be implemented within ML since ML does not support multivariate, so the algorithm should live within MLlib for a while until you figure out a generic interface? By support I mean handling the .fit and .transform staff etc. Add multivariate regression (or prediction) interface - Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9120) Add multivariate regression (or prediction) interface
Alexander Ulanov created SPARK-9120: --- Summary: Add multivariate regression (or prediction) interface Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9120: Description: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Update:After reading the design docs, adding predictMultivariate to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has predict:Double. Its train method uses predict:Double for prediction, i.e. PredictionModel is hard-coded to have only one output. It is the same problem that I pointed out long time ago in MLLib ( was: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Add multivariate regression (or prediction) interface - Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Update:After reading the design docs, adding predictMultivariate to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has predict:Double. Its train method uses predict:Double for prediction, i.e. PredictionModel is hard-coded to have only one output. It is the same problem that I pointed out long time ago in MLLib ( -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface
[ https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9120: Description: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Update: After reading the design docs, adding predictMultivariate to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has predict:Double. Its train method uses predict:Double for prediction, i.e. PredictionModel (and RegressionModel) is hard-coded to have only one output. There exist a similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). The possible solution for this problem might require to redesign the class hierarchy or addition of a separate interface that extends model. Though the latter means code duplication. was: org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Update:After reading the design docs, adding predictMultivariate to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has predict:Double. Its train method uses predict:Double for prediction, i.e. PredictionModel is hard-coded to have only one output. It is the same problem that I pointed out long time ago in MLLib ( Add multivariate regression (or prediction) interface - Key: SPARK-9120 URL: https://issues.apache.org/jira/browse/SPARK-9120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h org.apache.spark.ml.regression.RegressionModel supports prediction only for a single variable with a method predict:Double by extending the Predictor. There is a need for multivariate prediction, at least for regression. I propose to modify RegressionModel interface similarly to how it is done in ClassificationModel, which supports multiclass classification. It has predict:Double and predictRaw:Vector. Analogously, RegressionModel should have something like predictMultivariate:Vector. Update: After reading the design docs, adding predictMultivariate to RegressionModel does not seem reasonable to me anymore. The issue is as follows. RegressionModel extends PredictionModel which has predict:Double. Its train method uses predict:Double for prediction, i.e. PredictionModel (and RegressionModel) is hard-coded to have only one output. There exist a similar problem in MLLib (https://issues.apache.org/jira/browse/SPARK-5362). The possible solution for this problem might require to redesign the class hierarchy or addition of a separate interface that extends model. Though the latter means code duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron
Alexander Ulanov created SPARK-11262: Summary: Unit test for gradient, loss layers, memory management for multilayer perceptron Key: SPARK-11262 URL: https://issues.apache.org/jira/browse/SPARK-11262 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.1 Reporter: Alexander Ulanov Fix For: 1.5.1 Multi-layer perceptron requires more rigorous tests and refactoring of layer interfaces to accommodate development of new features. 1)Implement unit test for gradient and loss 2)Refactor the internal layer interface to extract "loss function" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997300#comment-14997300 ] Alexander Ulanov commented on SPARK-5575: - Hi Narine, Thank you for your observation. It seems that such information is useful to know. Indeed, LBFGS in Spark does not print any information during the execution. ANN uses Spark's LBFGS. You might want to add the needed output to the LBFGS code https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L185. Best regards, Alexander > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1, 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers: References: 1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1, 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1. Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008. http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers References: 1, 2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(3371–3408). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988705#comment-14988705 ] Alexander Ulanov commented on SPARK-5575: - Hi Disha, RNN is a major feature. I suggest to start from a smaller contribution. Spark contains the implementation of multi-layer perceptron since version 1.5. New features are supposed to re-use its code and follow the internal API that it has introduced. > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992447#comment-14992447 ] Alexander Ulanov commented on SPARK-9273: - Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992447#comment-14992447 ] Alexander Ulanov edited comment on SPARK-9273 at 11/5/15 8:50 PM: -- Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. Does it work with LBFGS? There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. was (Author: avulanov): Hi Yuhao. Sounds good! Thanks for refactoring the code to support ANN internal interface. Also, I was able to run your example. It shows increasing accuracy while training however it is not very fast. There is a good explanation how to use matrices multiplication in convolution: http://cs231n.github.io/convolutional-networks/. Basically, one needs to roll all image patches (regions that will be convolved) the into vectors and stack them together in a matrix. The weights of convolutional layer also should be rolled into vectors and stacked. Multiplying two mentioned matrices provides the convolution result that can be unrolled to 3d matrix, however it would not be necessary for this implementation. We can discuss it offline if you wish. Besides the optimization, there are few more things to be done. It includes unit tests for new layers, gradient test, representing pooling layer as functional layer, and performance comparison with the other implementation of CNN. You can take a look at the tests I've added for MLP https://issues.apache.org/jira/browse/SPARK-11262 and MLP benchmark at https://github.com/avulanov/ann-benchmark. A separate branch/repo for these developments might be a good thing to do. I'll be happy to help you with this. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423 ] Alexander Ulanov commented on SPARK-10408: -- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10409) Multilayer perceptron regression
Alexander Ulanov created SPARK-10409: Summary: Multilayer perceptron regression Key: SPARK-10409 URL: https://issues.apache.org/jira/browse/SPARK-10409 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Implement regression based on multilayer perceptron (MLP). It should support different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. The implementation might take advantage of autoencoder. Time-series forecasting for financial data might be one of the use cases, see http://dl.acm.org/citation.cfm?id=561452. So there is the need for more specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10409) Multilayer perceptron regression
[ https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726435#comment-14726435 ] Alexander Ulanov commented on SPARK-10409: -- Basic implementation with the current ML api can be found here: https://github.com/avulanov/spark/blob/a2261330c227be8ef26172dbe355a617d653553a/mllib/src/main/scala/org/apache/spark/ml/regression/MultilayerPerceptronRegressor.scala > Multilayer perceptron regression > > > Key: SPARK-10409 > URL: https://issues.apache.org/jira/browse/SPARK-10409 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Implement regression based on multilayer perceptron (MLP). It should support > different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. > The implementation might take advantage of autoencoder. Time-series > forecasting for financial data might be one of the use cases, see > http://dl.acm.org/citation.cfm?id=561452. So there is the need for more > specific requirements from this (or other) area. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Description: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers: References: 1-3. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf was: Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/autoencoder-mlp/mllib/src/main/scala/org/apache/spark/ml/feature/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:55 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp ( was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726423#comment-14726423 ] Alexander Ulanov edited comment on SPARK-10408 at 9/1/15 11:54 PM: --- Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp (https://github.com/avulanov/spark/blob/ann-auto-rbm-mlor/mllib/src/main/scala/org/apache/spark/mllib/ann/Autoencoder.scala) was (Author: avulanov): Added implementation for (1) that is basic deep autoencoder https://github.com/avulanov/spark/tree/autoencoder-mlp > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10408: - Issue Type: Umbrella (was: Improvement) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10408) Autoencoder
Alexander Ulanov created SPARK-10408: Summary: Autoencoder Key: SPARK-10408 URL: https://issues.apache.org/jira/browse/SPARK-10408 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Goal: Implement various types of autoencoders Requirements: 1)Basic (deep) autoencoder that supports different types of inputs: binary, real in [0..1]. real in [-inf, +inf] 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to the MLP and then used here 3)Denoising autoencoder 4)Stacked autoencoder for pre-training of deep networks. It should support arbitrary network layers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-10324: - Description: Following SPARK-8445, we created this master list for MLlib features we plan to have in Spark 1.6. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add `@Since("1.6.0")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * log-linear model for survival analysis (SPARK-8518) * normal equation approach for linear regression (SPARK-9834) * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) * robust linear regression with Huber loss (SPARK-3181) * vector-free L-BFGS (SPARK-10078) * tree partition by features (SPARK-3717) * bisecting k-means (SPARK-6517) * weighted instance support (SPARK-9610) ** logistic regression (SPARK-7685) ** linear regression (SPARK-9642) ** random forest (SPARK-9478) * locality sensitive hashing (LSH) (SPARK-5992) * deep learning (SPARK-5575) ** autoencoder (SPARK-10408) ** restricted Boltzmann machine (RBM) (SPARK-4251) ** convolutional neural network (stretch) * factorization machine (SPARK-7008) * local linear algebra (SPARK-6442) * distributed LU decomposition (SPARK-8514) h2. Statistics * univariate statistics as UDAFs (SPARK-10384) * bivariate statistics as UDAFs (SPARK-10385) * R-like statistics for GLMs (SPARK-9835) * online hypothesis testing (SPARK-3147) h2. Pipeline API * pipeline persistence (SPARK-6725) * ML attribute API improvements (SPARK-8515) * feature transformers (SPARK-9930) ** feature interaction (SPARK-9698) ** SQL transformer (SPARK-8345) ** ?? * predict single instance (SPARK-10413) * test Kaggle datasets (SPARK-9941) h2. Model persistence * PMML export ** naive Bayes (SPARK-8546) ** decision tree (SPARK-8542) * model save/load ** FPGrowth (SPARK-6724) ** PrefixSpan (SPARK-10386) * code generation ** decision tree and tree ensembles (SPARK-10387) h2. Data sources * LIBSVM data source (SPARK-10117) * public dataset loader (SPARK-10388) h2. Python API for ML The main goal of Python API is to have feature parity with Scala/Java API. You can find a complete list [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall into two major categories: * Python API for new algorithms * Python API for missing methods h2. SparkR API for ML * support more families and link functions in SparkR::glm (SPARK-9838, SPARK-9839, SPARK-9840) * better R formula support (SPARK-9681) * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) h2. Documentation * re-organize user guide (SPARK-8517) * @Since versions in spark.ml, pyspark.mllib, and
[jira] [Closed] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov closed SPARK-4752. --- Resolution: Fixed Fix Version/s: 1.5.0 > Classifier based on artificial neural network > - > > Key: SPARK-4752 > URL: https://issues.apache.org/jira/browse/SPARK-4752 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Alexander Ulanov > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Implement classifier based on artificial neural network (ANN). Requirements: > 1) Use the existing artificial neural network implementation > https://issues.apache.org/jira/browse/SPARK-2352, > https://github.com/apache/spark/pull/1290 > 2) Extend MLlib ClassificationModel trait, > 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, > 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10627) Regularization for artificial neural networks
[ https://issues.apache.org/jira/browse/SPARK-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746636#comment-14746636 ] Alexander Ulanov commented on SPARK-10627: -- Dropout WIP refactoring for the new ML API https://github.com/avulanov/spark/tree/dropout-mlp. > Regularization for artificial neural networks > - > > Key: SPARK-10627 > URL: https://issues.apache.org/jira/browse/SPARK-10627 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Add regularization for artificial neural networks. Includes, but not limited > to: > 1)L1 and L2 regularization > 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf > 3)Dropconnect > http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10627) Regularization for artificial neural networks
Alexander Ulanov created SPARK-10627: Summary: Regularization for artificial neural networks Key: SPARK-10627 URL: https://issues.apache.org/jira/browse/SPARK-10627 Project: Spark Issue Type: Umbrella Components: ML Affects Versions: 1.5.0 Reporter: Alexander Ulanov Priority: Minor Add regularization for artificial neural networks. Includes, but not limited to: 1)L1 and L2 regularization 2)Dropout http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf 3)Dropconnect http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737948#comment-14737948 ] Alexander Ulanov commented on SPARK-9273: - Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Do you think that it is reasonable? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737948#comment-14737948 ] Alexander Ulanov edited comment on SPARK-9273 at 9/10/15 1:18 AM: -- Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Does it make sense to you? was (Author: avulanov): Hi Yuhao! I have few comments regarding the interface and the optimization of your implementation. There are two options of optimizing convolutions: using matrix-matrix multiplication and using FFTs. The latter seems a bit more complicated since we don't have optimized parallel FFT in Spark. It also has to support batch data processing. Instead, if one uses matrix-matrix multiplication for convolution, then it can take advantage of native BLAS and batch computations can be supported straightforward. Another benefit is that we would not need to change the current Layer's input/ouptput type (matrix) to tensor. We can store the unwrapped inputs/outputs as vectors within the input/output matrix. Do you think that it is reasonable? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940446#comment-14940446 ] Alexander Ulanov commented on SPARK-5575: - Hi, Weide, Sounds good! What kind of feature are you planning to add? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944173#comment-14944173 ] Alexander Ulanov commented on SPARK-5575: - Weide, These are major features and some of them are under development. You can check their status in the linked issues. Could you work on something smaller as a first step? [~mengxr], do you have any suggestions? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows
Alexander Ulanov created SPARK-15893: Summary: spark.createDataFrame raises an exception in Spark 2.0 tests on Windows Key: SPARK-15893 URL: https://issues.apache.org/jira/browse/SPARK-15893 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.0.0 Reporter: Alexander Ulanov spark.createDataFrame raises an exception in Spark 2.0 tests on Windows For example, LogisticRegressionSuite fails at Line 46: Exception encountered when invoking run on a nested suite - java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109) Another example, DataFrameSuite raises: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325377#comment-15325377 ] Alexander Ulanov commented on SPARK-15581: -- I would like to comment on Breeze and deep learning parts, because I have been implementing multilayer perceptron for Spark and have used Breeze a lot. Breeze provides convenient abstraction for dense and sparse vectors and matrices and allows performing linear algebra backed by netlib-java and native BLAS. At the same time Spark "linalg" has its own abstractions for that. This might be confusing to users and developers. Obviously, Spark should have a single library for linear algebra. Having said that, Breeze is more convenient and flexible than linalg, though it misses some features such as in-place matrix multiplications and multidimensional arrays. Breeze cannot be removed from Spark because "linalg" does not have enough functionality to fully replace it. To address this, I have implemented a Scala tensor library on top of netlib-java. "linalg" can be wrapped around it. It also provides functions similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], [~dbtsai] and myself were planning to discuss this after the 2.0 release, and I am posting these considerations here since you raised this question too. Could you take a look on this library and tell what do you think? The source code is here https://github.com/avulanov/scala-tensor With regards to deep learning, I believe that having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, I think that Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. Spark ML already has fully connected networks in place. Stacked autoencoder is implemented but not merged yet. The only thing that remains is convolutional network. These 3 will provide a comprehensive deep learning set for Spark ML. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If
[jira] [Created] (SPARK-15851) Spark 2.0 does not compile in Windows 7
Alexander Ulanov created SPARK-15851: Summary: Spark 2.0 does not compile in Windows 7 Key: SPARK-15851 URL: https://issues.apache.org/jira/browse/SPARK-15851 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.0.0 Environment: Windows 7 Reporter: Alexander Ulanov Spark does not compile in Windows 7. "mvn compile" fails on spark-core due to trying to execute a bash script spark-build-info. Work around: 1)Install win-bash and put in path 2)Change line 350 of core/pom.xml Error trace: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "C:\dev\spark\core\..\build\spark-build-info" (in directory "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 application [ERROR] around Ant part .. @ 4:73 in C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org