[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/5915#issuecomment-108594833 Anything I should try to do to fix this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/5915#issuecomment-108632111 Yay! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/5915#issuecomment-108092203 Ok, think I fixed the merge and cleaned up the pull request so it is just my files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/5915#issuecomment-108078500 l will try to resolve and update the pull request. On Tue, Jun 2, 2015 at 10:49 AM, jkbradley notificati...@github.com wrote: Uh oh, those tests won't work because of merge conflicts. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/5915#issuecomment-108029093. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7545][mllib] Added check in Bernoulli N...
GitHub user leahmcguire opened a pull request: https://github.com/apache/spark/pull/6073 [SPARK-7545][mllib] Added check in Bernoulli Naive Bayes to make sure that both training and predict feature have values of 0 or 1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/leahmcguire/spark binaryCheckNB Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6073.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6073 commit 04f0d3c6732ce503de95c0b3e8bcf87f16767877 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-05T17:40:09Z Added stats from cross validation as a val in the cross validation model to save them for user access commit 58d060b518133b1e64ef86ca7aee61b76d6c6990 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-08T01:16:27Z changed param name and test according to comments commit f191c71afcfe1b9a0d989669c152fad58d4bab89 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-08T01:20:55Z fixed name commit 67253f08cdf97a32c7caf2c6e65fee495e218aad Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-12T03:52:53Z added check to bernoulli to ensure feature values are zero or one commit f44bb3c39c0d73e7d8a67a6e79f6bd741cdb0425 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-12T04:07:00Z removed changes from CV branch commit 831fd279e16a97711b30346c19a1dcde16728f19 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-12T05:28:51Z got test working --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
GitHub user leahmcguire reopened a pull request: https://github.com/apache/spark/pull/5915 [SPARK-6164] [ML] CrossValidatorModel should keep stats from fitting Added stats from cross validation as a val in the cross validation model to save them for user access. You can merge this pull request into a Git repository by running: $ git pull https://github.com/leahmcguire/spark saveCVmetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5915.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5915 commit e0020099abbfd6b968abb5a778518c9cbdac9d59 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-05T17:40:09Z Added stats from cross validation as a val in the cross validation model to save them for user access commit 47728db7cfac995d9417cdf0e16d07391aabd581 Author: Sandy Ryza sa...@cloudera.com Date: 2015-05-05T19:34:02Z [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza sa...@cloudera.com Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer commit 489700c809a7c0a836538f3d0bd58bed609e8768 Author: zsxwing zsxw...@gmail.com Date: 2015-05-05T19:52:16Z [SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for streaming statistics This is the initial work of SPARK-6939. Not yet ready for code review. Here are the screenshots: ![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png) ![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png) TODOs: - [x] Display more information on mouse hover - [x] Align the timeline and distribution graphs - [x] Clean up the codes Author: zsxwing zsxw...@gmail.com Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits: 9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939 deacc3f [zsxwing] Remove unused import cd03424 [zsxwing] Fix .rat-excludes 70cc87d [zsxwing] Streaming Scheduling Delay = Scheduling Delay d457277 [zsxwing] Fix UIUtils in BatchPage b3f303e [zsxwing] Add comments for unclear classes and methods ff0bff8 [zsxwing] Make InputDStream.name private[streaming] cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939 e275e23 [zsxwing] Move time related methods to Streaming's UIUtils d5d86f6 [zsxwing] Fix incorrect lastErrorTime 3be4b7a [zsxwing] Use InputInfo b50fa32 [zsxwing] Jump to the batch page when clicking a point in the timeline graphs 203605d [zsxwing] Merge branch 'master' into SPARK-6939 74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page size 2586916 [zsxwing] Merge branch 'master' into SPARK-6939 70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames 7bbdc0a [zsxwing] Hide the receiver sub table if no receiver a2972e9 [zsxwing] Add some ui tests for StreamingPage fd03ad0 [zsxwing] Add a test to verify no memory leak 4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939 18607a1 [zsxwing] Merge branch 'master' into SPARK-6939 d0b0aec [zsxwing] Clean up the codes a459f49 [zsxwing] Add a dash line to processing time graphs 8e4363c [zsxwing] Prepare for the demo c81a1ee [zsxwing] Change time unit in the graphs automatically 4c0b43f [zsxwing] Update Streaming UI 04c7500 [zsxwing] Make the server and client use the same timezone fed8219 [zsxwing] Move the x axis at the top and show a better tooltip c23ce10 [zsxwing] Make two graphs close d78672a [zsxwing] Make the X axis use the same range 881c907 [zsxwing] Use histogram for distribution 5688702 [zsxwing] Fix the unit test ddf741a [zsxwing] Fix the unit test ad93295 [zsxwing] Remove unnecessary codes a0458f9 [zsxwing] Clean the codes b82ed1e [zsxwing] Update the graphs as per comments dd653a1
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire closed the pull request at: https://github.com/apache/spark/pull/5915 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/5915#issuecomment-100051023 Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164 [ML] CrossValidatorModel should ke...
GitHub user leahmcguire opened a pull request: https://github.com/apache/spark/pull/5911 [SPARK-6164 [ML] CrossValidatorModel should keep stats from fitting Added stats from cross validation as a val in the cross validation model to save them for user access. You can merge this pull request into a Git repository by running: $ git pull https://github.com/leahmcguire/spark saveCVmetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5911.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5911 commit ce73c63e8bac40b02ae0a8147c3b424783f6094a Author: leahmcguire lmcgu...@salesforce.com Date: 2015-01-16T16:06:06Z added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html commit 4a3676d8d7e8c30778f95e9f479d97b4b1651ce4 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-01-21T00:19:14Z Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated. commit 0313c0cbf8d41b9bcfb0536df253f6af0f1398f7 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-01-21T17:43:00Z fixed style error in NaiveBayes.scala commit 76e5b0f90e370e2cda20e1348bf40ff890f51782 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-01-26T18:29:47Z removed unnecessary sort from test commit d9477ed8450594de9f2da24af8f82c82def5ce24 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-02-26T17:16:12Z removed old inaccurate comment from test suite for mllib naive bayes commit 3891bf2f708bda712028551334960d2cc66af536 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-02-27T05:34:01Z synced with apache spark and resolved merge conflict commit 5a4a534d3636100546b5fa86d2d7ec2ed2051582 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-02-27T16:56:24Z fixed scala style error in NaiveBayes commit b61b5e2d91582689642fb045849df62a16ce111c Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-02T18:50:18Z added back compatable constructor to NaiveBayesModel to fix MIMA test failure commit 37305729334922c40804752598a30a2fb892c317 Author: Joseph K. Bradley jos...@databricks.com Date: 2015-03-03T23:22:20Z modified NB model type to be more Java-friendly commit b93aaf682572890c49a58da149612c0053afc3de Author: Leah McGuire lmcgu...@salesforce.com Date: 2015-03-05T19:03:33Z Merge pull request #1 from jkbradley/nb-model-type modified NB model type to be more Java-friendly commit 7622b0c002c12efd8fb2c6fa34a691c82c86edd8 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:07:25Z added comments and fixed style as per rb commit dc65374b4c7933700ffa4e3f572ec44ece382a05 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:24:50Z integrated model type fix commit 85f298f251f757772294ea68988522a5c26a19ac Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:25:34Z Merge remote-tracking branch 'upstream/master' commit e01656978174f8ecbd75ef6a50211234a1babfc6 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:28:05Z updated test suite with model type fix commit ea09b28c908e86f8ebc7bbb3e98bfe83cc636b78 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:48:41Z Merge remote-tracking branch 'upstream/master' commit 900b5864c16cc0db93a46ec3a4591a787e5a21a0 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T19:53:46Z fixed model call so that uses type argument commit b85b0c9e602770702a477cc36c7d72e2410c5139 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T21:04:02Z Merge remote-tracking branch 'upstream/master' commit c298e78ba7d58bb4d7e9b54d56ce51fe6b6b10a9 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T21:16:08Z fixed scala style errors commit 2d0c1ba631841a0c55212fbc8dd7327285972ef8 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-05T21:42:42Z fixed typo in NaiveBayes commit e2d925eb088f7cabb38024ecb7b0628557d261ba Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-07T01:26:17Z fixed nonserializable error that was causing naivebayes test failures commit fb0a5c70ce935cb8d9495152c809e06c8f350443 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-09T20:36:36Z removed typo commit 01baad70f44fa12ad37a743d5d0fba861d89f149 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-11T22:44:22Z made fixes from code review commit bea62af37fdf389474474d80fdac3c94f6a8808f Author: leahmcguire lmcgu...@salesforce.com Date: 2015-03-12T18:10:16Z put back in constructor for NaiveBayes commit
[GitHub] spark pull request: [SPARK-6164 [ML] CrossValidatorModel should ke...
Github user leahmcguire closed the pull request at: https://github.com/apache/spark/pull/5911 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...
GitHub user leahmcguire opened a pull request: https://github.com/apache/spark/pull/5915 [SPARK-6164] [ML] CrossValidatorModel should keep stats from fitting Added stats from cross validation as a val in the cross validation model to save them for user access. You can merge this pull request into a Git repository by running: $ git pull https://github.com/leahmcguire/spark saveCVmetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5915.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5915 commit e0020099abbfd6b968abb5a778518c9cbdac9d59 Author: leahmcguire lmcgu...@salesforce.com Date: 2015-05-05T17:40:09Z Added stats from cross validation as a val in the cross validation model to save them for user access --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-86292839 Either version is fine. If you have time to make the change on tomorrow go ahead and send the PR. Otherwise I'll have time to make the change on Friday. On Wed, Mar 25, 2015 at 12:41 PM, jkbradley notificati...@github.com wrote: (I was about to merge this, but then this issue came up.) After that adjustment, it should be fine. (And feel free to make this change yourself, but I'm offering to do it since the dev list discussion keeps going back and forth.) â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4087#issuecomment-86187804. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r26542594 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -156,9 +181,14 @@ object NaiveBayesModel extends Loader[NaiveBayesModel] { * document classification. By making every vector a 0-1 vector, it can also be used as * Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The input feature values must be nonnegative. */ -class NaiveBayes private (private var lambda: Double) extends Serializable with Logging { - def this() = this(1.0) +class NaiveBayes private ( +private var lambda: Double, +private var modelType: NaiveBayes.ModelType) extends Serializable with Logging { + + def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial) --- End diff -- Nope, I tried adding it back as private before just adding it back and it still failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r26543828 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -35,26 +39,30 @@ import org.apache.spark.sql.{DataFrame, SQLContext} * @param pi log of class priors, whose dimension is C, number of labels * @param theta log of class conditional probabilities, whose dimension is C-by-D, * where D is number of features + * @param modelType The type of NB model to fit from the enumeration NaiveBayesModels, can be + * Multinomial or Bernoulli */ class NaiveBayesModel private[mllib] ( val labels: Array[Double], val pi: Array[Double], -val theta: Array[Array[Double]]) extends ClassificationModel with Serializable with Saveable { +val theta: Array[Array[Double]], +val modelType: String) --- End diff -- Yep that fixes it :P --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r26347821 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -156,9 +181,14 @@ object NaiveBayesModel extends Loader[NaiveBayesModel] { * document classification. By making every vector a 0-1 vector, it can also be used as * Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The input feature values must be nonnegative. */ -class NaiveBayes private (private var lambda: Double) extends Serializable with Logging { - def this() = this(1.0) +class NaiveBayes private ( +private var lambda: Double, +private var modelType: NaiveBayes.ModelType) extends Serializable with Logging { + + def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial) --- End diff -- Removing this causes MiMa test failures. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r26256579 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -35,26 +39,30 @@ import org.apache.spark.sql.{DataFrame, SQLContext} * @param pi log of class priors, whose dimension is C, number of labels * @param theta log of class conditional probabilities, whose dimension is C-by-D, * where D is number of features + * @param modelType The type of NB model to fit from the enumeration NaiveBayesModels, can be + * Multinomial or Bernoulli */ class NaiveBayesModel private[mllib] ( val labels: Array[Double], val pi: Array[Double], -val theta: Array[Array[Double]]) extends ClassificationModel with Serializable with Saveable { +val theta: Array[Array[Double]], +val modelType: String) --- End diff -- I had to change this from the enum like type to the string to fix the unit test failures. An actual enum worked but the substitute that you suggested was throwing an non-serializable error on all of the NaiveBayes tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r26258688 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -262,4 +303,58 @@ object NaiveBayes { def train(input: RDD[LabeledPoint], lambda: Double): NaiveBayesModel = { new NaiveBayes(lambda).run(input) } + + + /** + * Trains a Naive Bayes model given an RDD of `(label, features)` pairs. + * + * The model type can be set to either Multinomial NB ([[http://tinyurl.com/lsdw6p]]) + * or Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The Multinomial NB can handle + * discrete count data and can be called by setting the model type to multinomial. + * For example, it can be used with word counts or TF_IDF vectors of documents. + * The Bernoulli model fits presence or absence (0-1) counts. By making every vector a + * 0-1 vector and setting the model type to bernoulli, the fits and predicts as + * Bernoulli NB. + * + * @param input RDD of `(label, array of features)` pairs. Every vector should be a frequency + * vector or a count vector. + * @param lambda The smoothing parameter + * + * @param modelType The type of NB model to fit from the enumeration NaiveBayesModels, can be + * multinomial or bernoulli + */ + def train(input: RDD[LabeledPoint], lambda: Double, modelType: String): NaiveBayesModel = { --- End diff -- If we remove this static train method should we also remove the static train method that just includes lambda (line 326). Otherwise the train calls are inconsistent for setting different model parameters (lambda and modelType). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-78381145 @jkbradley thanks for the comments! I have implemented everything except the two inline comments that I replied to directly. I'm not clear about how you want the versioning implemented on the save/load so it may be simpler for you to just push a PR to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-77435497 I made all the inline fixes and integrated the model type fix. If you can provide me with a bit more guidance on the save/load I am happy to do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on a diff in the pull request: https://github.com/apache/spark/pull/4087#discussion_r25443491 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/NaiveBayesSuite.scala --- @@ -71,23 +86,67 @@ class NaiveBayesSuite extends FunSuite with MLlibTestSparkContext { assert(numOfPredictions input.length / 5) } - test(Naive Bayes) { + def validateModelFit(piData: Array[Double], thetaData: Array[Array[Double]], model: NaiveBayesModel) = { +def closeFit(d1: Double, d2: Double, precision: Double): Boolean = { + (d1 - d2).abs = precision +} +val modelIndex = (0 until piData.length).zip(model.labels.map(_.toInt)) +for (i - modelIndex) { + assert(closeFit(math.exp(piData(i._2)), math.exp(model.pi(i._1)), 0.05)) +} +for (i - modelIndex) { + for (j - 0 until thetaData(i._2).length) { +assert(closeFit(math.exp(thetaData(i._2)(j)), math.exp(model.theta(i._1)(j)), 0.05)) + } +} + } + + test(Naive Bayes Multinomial) { +val nPoints = 1000 + +val pi = Array(0.5, 0.1, 0.4).map(math.log) +val theta = Array( + Array(0.70, 0.10, 0.10, 0.10), // label 0 + Array(0.10, 0.70, 0.10, 0.10), // label 1 + Array(0.10, 0.10, 0.70, 0.10) // label 2 +).map(_.map(math.log)) + +val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, nPoints, 42, NaiveBayesModels.Multinomial) +val testRDD = sc.parallelize(testData, 2) +testRDD.cache() + +val model = NaiveBayes.train(testRDD, 1.0, Multinomial) +validateModelFit(pi, theta, model) + +val validationData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, nPoints, 17, NaiveBayesModels.Multinomial) +val validationRDD = sc.parallelize(validationData, 2) + +// Test prediction on RDD. + validatePrediction(model.predict(validationRDD.map(_.features)).collect(), validationData) + +// Test prediction on Array. +validatePrediction(validationData.map(row = model.predict(row.features)), validationData) + } + + test(Naive Bayes Bernoulli) { val nPoints = 1 val pi = Array(0.5, 0.3, 0.2).map(math.log) val theta = Array( - Array(0.91, 0.03, 0.03, 0.03), // label 0 - Array(0.03, 0.91, 0.03, 0.03), // label 1 - Array(0.03, 0.03, 0.91, 0.03) // label 2 + Array(0.50, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.40), // label 0 + Array(0.02, 0.70, 0.10, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02), // label 1 + Array(0.02, 0.02, 0.60, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.30) // label 2 ).map(_.map(math.log)) -val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, nPoints, 42) + +val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, nPoints, 45, NaiveBayesModels.Bernoulli) val testRDD = sc.parallelize(testData, 2) testRDD.cache() -val model = NaiveBayes.train(testRDD) +val model = NaiveBayes.train(testRDD, 1.0, Bernoulli) ///!!! this gives same result on both models check the math --- End diff -- No this was resolved before the commit. I just forgot to remove the comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user leahmcguire commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-70597399 Thanks for the comments! The JIRA for the python API is: https://issues.apache.org/jira/browse/SPARK-5328 I will get the rest fixed tonight or tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
GitHub user leahmcguire opened a pull request: https://github.com/apache/spark/pull/4087 [SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html. Default for model is original Multinomial fit and predict. Added additional testing for Bernoulli and Multinomial models. You can merge this pull request into a Git repository by running: $ git pull https://github.com/leahmcguire/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4087 commit ce73c63e8bac40b02ae0a8147c3b424783f6094a Author: leahmcguire lmcgu...@salesforce.com Date: 2015-01-16T16:06:06Z added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org