[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-06-03 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/5915#issuecomment-108594833
  
Anything I should try to do to fix this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-06-03 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/5915#issuecomment-108632111
  
Yay!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-06-02 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/5915#issuecomment-108092203
  
Ok, think I fixed the merge and cleaned up the pull request so it is just 
my files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-06-02 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/5915#issuecomment-108078500
  
l will try to resolve and update the pull request.

On Tue, Jun 2, 2015 at 10:49 AM, jkbradley notificati...@github.com wrote:

 Uh oh, those tests won't work because of merge conflicts.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/5915#issuecomment-108029093.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7545][mllib] Added check in Bernoulli N...

2015-05-11 Thread leahmcguire
GitHub user leahmcguire opened a pull request:

https://github.com/apache/spark/pull/6073

[SPARK-7545][mllib] Added check in Bernoulli Naive Bayes to make sure that 
both training and predict feature have values of 0 or 1



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/leahmcguire/spark binaryCheckNB

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6073.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6073


commit 04f0d3c6732ce503de95c0b3e8bcf87f16767877
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-05T17:40:09Z

Added stats from cross validation as a val in the cross validation model to 
save them for user access

commit 58d060b518133b1e64ef86ca7aee61b76d6c6990
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-08T01:16:27Z

changed param name and test according to comments

commit f191c71afcfe1b9a0d989669c152fad58d4bab89
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-08T01:20:55Z

fixed name

commit 67253f08cdf97a32c7caf2c6e65fee495e218aad
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-12T03:52:53Z

added check to bernoulli to ensure feature values are zero or one

commit f44bb3c39c0d73e7d8a67a6e79f6bd741cdb0425
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-12T04:07:00Z

removed changes from CV branch

commit 831fd279e16a97711b30346c19a1dcde16728f19
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-12T05:28:51Z

got test working




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-05-07 Thread leahmcguire
GitHub user leahmcguire reopened a pull request:

https://github.com/apache/spark/pull/5915

[SPARK-6164] [ML] CrossValidatorModel should keep stats from fitting

Added stats from cross validation as a val in the cross validation model to 
save them for user access.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/leahmcguire/spark saveCVmetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5915.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5915


commit e0020099abbfd6b968abb5a778518c9cbdac9d59
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-05T17:40:09Z

Added stats from cross validation as a val in the cross validation model to 
save them for user access

commit 47728db7cfac995d9417cdf0e16d07391aabd581
Author: Sandy Ryza sa...@cloudera.com
Date:   2015-05-05T19:34:02Z

[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer

This patch adds a one hot encoder for categorical features.  Planning to 
add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 
1 columns and, if true, creates numCategories columns.  The default is true, 
which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating 
a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The 
names are used for the output column names, which take the form 
colName_categoryName.

Author: Sandy Ryza sa...@cloudera.com

Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

commit 489700c809a7c0a836538f3d0bd58bed609e8768
Author: zsxwing zsxw...@gmail.com
Date:   2015-05-05T19:52:16Z

[SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for 
streaming statistics

This is the initial work of SPARK-6939. Not yet ready for code review. Here 
are the screenshots:


![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png)


![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png)

TODOs:
- [x] Display more information on mouse hover
- [x] Align the timeline and distribution graphs
- [x] Clean up the codes

Author: zsxwing zsxw...@gmail.com

Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits:

9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939
deacc3f [zsxwing] Remove unused import
cd03424 [zsxwing] Fix .rat-excludes
70cc87d [zsxwing] Streaming Scheduling Delay = Scheduling Delay
d457277 [zsxwing] Fix UIUtils in BatchPage
b3f303e [zsxwing] Add comments for unclear classes and methods
ff0bff8 [zsxwing] Make InputDStream.name private[streaming]
cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939
e275e23 [zsxwing] Move time related methods to Streaming's UIUtils
d5d86f6 [zsxwing] Fix incorrect lastErrorTime
3be4b7a [zsxwing] Use InputInfo
b50fa32 [zsxwing] Jump to the batch page when clicking a point in the 
timeline graphs
203605d [zsxwing] Merge branch 'master' into SPARK-6939
74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page 
size
2586916 [zsxwing] Merge branch 'master' into SPARK-6939
70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames
7bbdc0a [zsxwing] Hide the receiver sub table if no receiver
a2972e9 [zsxwing] Add some ui tests for StreamingPage
fd03ad0 [zsxwing] Add a test to verify no memory leak
4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939
18607a1 [zsxwing] Merge branch 'master' into SPARK-6939
d0b0aec [zsxwing] Clean up the codes
a459f49 [zsxwing] Add a dash line to processing time graphs
8e4363c [zsxwing] Prepare for the demo
c81a1ee [zsxwing] Change time unit in the graphs automatically
4c0b43f [zsxwing] Update Streaming UI
04c7500 [zsxwing] Make the server and client use the same timezone
fed8219 [zsxwing] Move the x axis at the top and show a better tooltip
c23ce10 [zsxwing] Make two graphs close
d78672a [zsxwing] Make the X axis use the same range
881c907 [zsxwing] Use histogram for distribution
5688702 [zsxwing] Fix the unit test
ddf741a [zsxwing] Fix the unit test
ad93295 [zsxwing] Remove unnecessary codes
a0458f9 [zsxwing] Clean the codes
b82ed1e [zsxwing] Update the graphs as per comments
dd653a1

[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-05-07 Thread leahmcguire
Github user leahmcguire closed the pull request at:

https://github.com/apache/spark/pull/5915


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-05-07 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/5915#issuecomment-100051023
  
Fixed 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164 [ML] CrossValidatorModel should ke...

2015-05-05 Thread leahmcguire
GitHub user leahmcguire opened a pull request:

https://github.com/apache/spark/pull/5911

[SPARK-6164 [ML] CrossValidatorModel should keep stats from fitting

Added stats from cross validation as a val in the cross validation model to 
save them for user access.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/leahmcguire/spark saveCVmetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5911


commit ce73c63e8bac40b02ae0a8147c3b424783f6094a
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-01-16T16:06:06Z

added Bernoulli option to niave bayes model in mllib, added optional model 
type parameter for training. When Bernoulli is given the Bernoulli smoothing is 
used for fitting and for prediction 
http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

commit 4a3676d8d7e8c30778f95e9f479d97b4b1651ce4
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-01-21T00:19:14Z

Updated changes re-comments. Got rid of verbose populateMatrix method. 
Public api now has string instead of enumeration. Docs are updated.

commit 0313c0cbf8d41b9bcfb0536df253f6af0f1398f7
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-01-21T17:43:00Z

fixed style error in NaiveBayes.scala

commit 76e5b0f90e370e2cda20e1348bf40ff890f51782
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-01-26T18:29:47Z

removed unnecessary sort from test

commit d9477ed8450594de9f2da24af8f82c82def5ce24
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-02-26T17:16:12Z

removed old inaccurate comment from test suite for mllib naive bayes

commit 3891bf2f708bda712028551334960d2cc66af536
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-02-27T05:34:01Z

synced with apache spark and resolved merge conflict

commit 5a4a534d3636100546b5fa86d2d7ec2ed2051582
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-02-27T16:56:24Z

fixed scala style error in NaiveBayes

commit b61b5e2d91582689642fb045849df62a16ce111c
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-02T18:50:18Z

added back compatable constructor to NaiveBayesModel to fix MIMA test 
failure

commit 37305729334922c40804752598a30a2fb892c317
Author: Joseph K. Bradley jos...@databricks.com
Date:   2015-03-03T23:22:20Z

modified NB model type to be more Java-friendly

commit b93aaf682572890c49a58da149612c0053afc3de
Author: Leah McGuire lmcgu...@salesforce.com
Date:   2015-03-05T19:03:33Z

Merge pull request #1 from jkbradley/nb-model-type

modified NB model type to be more Java-friendly

commit 7622b0c002c12efd8fb2c6fa34a691c82c86edd8
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:07:25Z

added comments and fixed style as per rb

commit dc65374b4c7933700ffa4e3f572ec44ece382a05
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:24:50Z

integrated model type fix

commit 85f298f251f757772294ea68988522a5c26a19ac
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:25:34Z

Merge remote-tracking branch 'upstream/master'

commit e01656978174f8ecbd75ef6a50211234a1babfc6
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:28:05Z

updated test suite with model type fix

commit ea09b28c908e86f8ebc7bbb3e98bfe83cc636b78
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:48:41Z

Merge remote-tracking branch 'upstream/master'

commit 900b5864c16cc0db93a46ec3a4591a787e5a21a0
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T19:53:46Z

fixed model call so that uses type argument

commit b85b0c9e602770702a477cc36c7d72e2410c5139
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T21:04:02Z

Merge remote-tracking branch 'upstream/master'

commit c298e78ba7d58bb4d7e9b54d56ce51fe6b6b10a9
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T21:16:08Z

fixed scala style errors

commit 2d0c1ba631841a0c55212fbc8dd7327285972ef8
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-05T21:42:42Z

fixed typo in NaiveBayes

commit e2d925eb088f7cabb38024ecb7b0628557d261ba
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-07T01:26:17Z

fixed nonserializable error that was causing naivebayes test failures

commit fb0a5c70ce935cb8d9495152c809e06c8f350443
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-09T20:36:36Z

removed typo

commit 01baad70f44fa12ad37a743d5d0fba861d89f149
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-11T22:44:22Z

made fixes from code review

commit bea62af37fdf389474474d80fdac3c94f6a8808f
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-03-12T18:10:16Z

put back in constructor for NaiveBayes

commit

[GitHub] spark pull request: [SPARK-6164 [ML] CrossValidatorModel should ke...

2015-05-05 Thread leahmcguire
Github user leahmcguire closed the pull request at:

https://github.com/apache/spark/pull/5911


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6164] [ML] CrossValidatorModel should k...

2015-05-05 Thread leahmcguire
GitHub user leahmcguire opened a pull request:

https://github.com/apache/spark/pull/5915

[SPARK-6164] [ML] CrossValidatorModel should keep stats from fitting

Added stats from cross validation as a val in the cross validation model to 
save them for user access.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/leahmcguire/spark saveCVmetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5915.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5915


commit e0020099abbfd6b968abb5a778518c9cbdac9d59
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-05-05T17:40:09Z

Added stats from cross validation as a val in the cross validation model to 
save them for user access




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-25 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-86292839
  
Either version is fine. If you have time to make the change on tomorrow go
ahead and send the PR. Otherwise I'll have time to make the change on
Friday.

On Wed, Mar 25, 2015 at 12:41 PM, jkbradley notificati...@github.com
wrote:

 (I was about to merge this, but then this issue came up.) After that
 adjustment, it should be fine. (And feel free to make this change 
yourself,
 but I'm offering to do it since the dev list discussion keeps going back
 and forth.)

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4087#issuecomment-86187804.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-16 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r26542594
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -156,9 +181,14 @@ object NaiveBayesModel extends Loader[NaiveBayesModel] 
{
  * document classification.  By making every vector a 0-1 vector, it can 
also be used as
  * Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The input feature values 
must be nonnegative.
  */
-class NaiveBayes private (private var lambda: Double) extends Serializable 
with Logging {
 
-  def this() = this(1.0)
+class NaiveBayes private (
+private var lambda: Double,
+private var modelType: NaiveBayes.ModelType) extends Serializable with 
Logging {
+
+  def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial)
--- End diff --

Nope, I tried adding it back as private before just adding it back and it 
still failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-16 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r26543828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -35,26 +39,30 @@ import org.apache.spark.sql.{DataFrame, SQLContext}
  * @param pi log of class priors, whose dimension is C, number of labels
  * @param theta log of class conditional probabilities, whose dimension is 
C-by-D,
  *  where D is number of features
+ * @param modelType The type of NB model to fit from the enumeration 
NaiveBayesModels, can be
+ *  Multinomial or Bernoulli
  */
 class NaiveBayesModel private[mllib] (
 val labels: Array[Double],
 val pi: Array[Double],
-val theta: Array[Array[Double]]) extends ClassificationModel with 
Serializable with Saveable {
+val theta: Array[Array[Double]],
+val modelType: String)
--- End diff --

Yep that fixes it :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-12 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r26347821
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -156,9 +181,14 @@ object NaiveBayesModel extends Loader[NaiveBayesModel] 
{
  * document classification.  By making every vector a 0-1 vector, it can 
also be used as
  * Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The input feature values 
must be nonnegative.
  */
-class NaiveBayes private (private var lambda: Double) extends Serializable 
with Logging {
 
-  def this() = this(1.0)
+class NaiveBayes private (
+private var lambda: Double,
+private var modelType: NaiveBayes.ModelType) extends Serializable with 
Logging {
+
+  def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial)
--- End diff --

Removing this causes MiMa test failures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-11 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r26256579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -35,26 +39,30 @@ import org.apache.spark.sql.{DataFrame, SQLContext}
  * @param pi log of class priors, whose dimension is C, number of labels
  * @param theta log of class conditional probabilities, whose dimension is 
C-by-D,
  *  where D is number of features
+ * @param modelType The type of NB model to fit from the enumeration 
NaiveBayesModels, can be
+ *  Multinomial or Bernoulli
  */
 class NaiveBayesModel private[mllib] (
 val labels: Array[Double],
 val pi: Array[Double],
-val theta: Array[Array[Double]]) extends ClassificationModel with 
Serializable with Saveable {
+val theta: Array[Array[Double]],
+val modelType: String)
--- End diff --

I had to change this from the enum like type to the string to fix the unit 
test failures. An actual enum worked but the substitute that you suggested was 
throwing an non-serializable error on all of the NaiveBayes tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-11 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r26258688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -262,4 +303,58 @@ object NaiveBayes {
   def train(input: RDD[LabeledPoint], lambda: Double): NaiveBayesModel = {
 new NaiveBayes(lambda).run(input)
   }
+
+
+  /**
+   * Trains a Naive Bayes model given an RDD of `(label, features)` pairs.
+   *
+   * The model type can be set to either Multinomial NB 
([[http://tinyurl.com/lsdw6p]])
+   * or Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The Multinomial NB 
can handle
+   * discrete count data and can be called by setting the model type to 
multinomial.
+   * For example, it can be used with word counts or TF_IDF vectors of 
documents.
+   * The Bernoulli model fits presence or absence (0-1) counts. By making 
every vector a
+   * 0-1 vector and setting the model type to bernoulli, the  fits and 
predicts as
+   * Bernoulli NB.
+   *
+   * @param input RDD of `(label, array of features)` pairs.  Every vector 
should be a frequency
+   *  vector or a count vector.
+   * @param lambda The smoothing parameter
+   *
+   * @param modelType The type of NB model to fit from the enumeration 
NaiveBayesModels, can be
+   *  multinomial or bernoulli
+   */
+  def train(input: RDD[LabeledPoint], lambda: Double, modelType: String): 
NaiveBayesModel = {
--- End diff --

If we remove this static train method should we also remove the static 
train method that just includes lambda (line 326).  Otherwise the train calls 
are inconsistent for setting different model parameters (lambda and modelType).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-11 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-78381145
  
@jkbradley thanks for the comments! I have implemented everything except 
the two inline comments that I replied to directly. 
I'm not clear about how you want the versioning implemented on the 
save/load so it may be simpler for you to just push a PR to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-03-05 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-77435497
  
I made all the inline fixes and integrated the model type fix. If you can 
provide me with a bit more guidance on the save/load I am happy to do it.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-02-26 Thread leahmcguire
Github user leahmcguire commented on a diff in the pull request:

https://github.com/apache/spark/pull/4087#discussion_r25443491
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/NaiveBayesSuite.scala
 ---
@@ -71,23 +86,67 @@ class NaiveBayesSuite extends FunSuite with 
MLlibTestSparkContext {
 assert(numOfPredictions  input.length / 5)
   }
 
-  test(Naive Bayes) {
+  def validateModelFit(piData: Array[Double], thetaData: 
Array[Array[Double]], model: NaiveBayesModel) = {
+def closeFit(d1: Double, d2: Double, precision: Double): Boolean = {
+  (d1 - d2).abs = precision
+}
+val modelIndex = (0 until piData.length).zip(model.labels.map(_.toInt))
+for (i - modelIndex) {
+  assert(closeFit(math.exp(piData(i._2)), math.exp(model.pi(i._1)), 
0.05))
+}
+for (i - modelIndex) {
+  for (j - 0 until thetaData(i._2).length) {
+assert(closeFit(math.exp(thetaData(i._2)(j)), 
math.exp(model.theta(i._1)(j)), 0.05))
+  }
+}
+  }
+
+  test(Naive Bayes Multinomial) {
+val nPoints = 1000
+
+val pi = Array(0.5, 0.1, 0.4).map(math.log)
+val theta = Array(
+  Array(0.70, 0.10, 0.10, 0.10), // label 0
+  Array(0.10, 0.70, 0.10, 0.10), // label 1
+  Array(0.10, 0.10, 0.70, 0.10)  // label 2
+).map(_.map(math.log))
+
+val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, 
nPoints, 42, NaiveBayesModels.Multinomial)
+val testRDD = sc.parallelize(testData, 2)
+testRDD.cache()
+
+val model = NaiveBayes.train(testRDD, 1.0, Multinomial)
+validateModelFit(pi, theta, model)
+
+val validationData = NaiveBayesSuite.generateNaiveBayesInput(pi, 
theta, nPoints, 17, NaiveBayesModels.Multinomial)
+val validationRDD = sc.parallelize(validationData, 2)
+
+// Test prediction on RDD.
+
validatePrediction(model.predict(validationRDD.map(_.features)).collect(), 
validationData)
+
+// Test prediction on Array.
+validatePrediction(validationData.map(row = 
model.predict(row.features)), validationData)
+  }
+
+  test(Naive Bayes Bernoulli) {
 val nPoints = 1
 
 val pi = Array(0.5, 0.3, 0.2).map(math.log)
 val theta = Array(
-  Array(0.91, 0.03, 0.03, 0.03), // label 0
-  Array(0.03, 0.91, 0.03, 0.03), // label 1
-  Array(0.03, 0.03, 0.91, 0.03)  // label 2
+  Array(0.50, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 
0.02, 0.40), // label 0
+  Array(0.02, 0.70, 0.10, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 
0.02, 0.02), // label 1
+  Array(0.02, 0.02, 0.60, 0.02,  0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 
0.02, 0.30)  // label 2
 ).map(_.map(math.log))
 
-val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, 
nPoints, 42)
+
+val testData = NaiveBayesSuite.generateNaiveBayesInput(pi, theta, 
nPoints, 45, NaiveBayesModels.Bernoulli)
 val testRDD = sc.parallelize(testData, 2)
 testRDD.cache()
 
-val model = NaiveBayes.train(testRDD)
+val model = NaiveBayes.train(testRDD, 1.0, Bernoulli) ///!!! this 
gives same result on both models check the math
--- End diff --

No this was resolved before the commit. I just forgot to remove the comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-01-19 Thread leahmcguire
Github user leahmcguire commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-70597399
  
Thanks for the comments! 

The JIRA for the python API is:
https://issues.apache.org/jira/browse/SPARK-5328

I will get the rest fixed tonight or tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-01-17 Thread leahmcguire
GitHub user leahmcguire opened a pull request:

https://github.com/apache/spark/pull/4087

[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib

Added optional model type parameter for  NaiveBayes training. Can be either 
Multinomial or Bernoulli. 

When Bernoulli is given the Bernoulli smoothing is used for fitting and for 
prediction as per: 
http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.

 Default for model is original Multinomial fit and predict.

Added additional testing for Bernoulli and Multinomial models.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/leahmcguire/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4087


commit ce73c63e8bac40b02ae0a8147c3b424783f6094a
Author: leahmcguire lmcgu...@salesforce.com
Date:   2015-01-16T16:06:06Z

added Bernoulli option to niave bayes model in mllib, added optional model 
type parameter for training. When Bernoulli is given the Bernoulli smoothing is 
used for fitting and for prediction 
http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org