[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-02-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302952#comment-14302952
 ] 

yuhao yang commented on SPARK-1405:
---

Hi everyone, I'm sharing an implementation of [Online 
LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at 
https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for 
anyone interested.

The work is based on the research from [Matt 
Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. 
Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online 
nature, the algorithm 
1. scans the corpus (doc sets) only once. Thus it {quote}needs not locally 
store or collect the documents and can be handily applied to streaming document 
collections. {quote}
2. breaks the massive corps into mini batches and takes one batch at a time, 
which downgrades memory and time consumption.
3. approximates the posterior as well as traditional approaches. (generate 
comparable or better results).

In demo runs, current implementation (with many details to be improved)
1. processed 8 millions short articles (Stackoverflow posts titles, avg length 
9, K=10) in 15 minutes.
2. processed entire English wiki dump set (5876K documents , avg length ~900 
words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes 
using a 4-node cluster(20G memory, can be much less)

Trial and suggestions are most welcome!

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Joseph K. Bradley
Priority: Critical
  Labels: features
 Fix For: 1.3.0

 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-02-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302952#comment-14302952
 ] 

yuhao yang edited comment on SPARK-1405 at 2/3/15 8:35 AM:
---

Hi everyone, I'm sharing an implementation of [Online 
LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at 
https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for 
anyone interested.

The work is based on the research from [Matt 
Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. 
Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online 
nature, the algorithm 
1. scans the corpus (doc sets) only once. Thus it needs not locally store or 
collect the documents and can be handily applied to streaming document 
collections.
2. breaks the massive corps into mini batches and takes one batch at a time, 
which downgrades memory and time consumption.
3. approximates the posterior as well as traditional approaches. (generate 
comparable or better results).

In demo runs, current implementation (with many details to be improved)
1. processed 8 millions short articles (Stackoverflow posts titles, avg length 
9, K=10) in 15 minutes.
2. processed entire English wiki dump set (5876K documents , avg length ~900 
words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes 
using a 4-node cluster(20G memory, can be much less)

Trial and suggestions are most welcome!


was (Author: yuhaoyan):
Hi everyone, I'm sharing an implementation of [Online 
LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at 
https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for 
anyone interested.

The work is based on the research from [Matt 
Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. 
Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online 
nature, the algorithm 
1. scans the corpus (doc sets) only once. Thus it {quote}needs not locally 
store or collect the documents and can be handily applied to streaming document 
collections. {quote}
2. breaks the massive corps into mini batches and takes one batch at a time, 
which downgrades memory and time consumption.
3. approximates the posterior as well as traditional approaches. (generate 
comparable or better results).

In demo runs, current implementation (with many details to be improved)
1. processed 8 millions short articles (Stackoverflow posts titles, avg length 
9, K=10) in 15 minutes.
2. processed entire English wiki dump set (5876K documents , avg length ~900 
words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes 
using a 4-node cluster(20G memory, can be much less)

Trial and suggestions are most welcome!

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Joseph K. Bradley
Priority: Critical
  Labels: features
 Fix For: 1.3.0

 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5566) Tokenizer for mllib package

2015-02-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308733#comment-14308733
 ] 

yuhao yang commented on SPARK-5566:
---

I mean only the underlying implementation. 

 Tokenizer for mllib package
 ---

 Key: SPARK-5566
 URL: https://issues.apache.org/jira/browse/SPARK-5566
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 There exist tokenizer classes in the spark.ml.feature package and in the 
 LDAExample in the spark.examples.mllib package.  The Tokenizer in the 
 LDAExample is more advanced and should be made into a full-fledged public 
 class in spark.mllib.feature.  The spark.ml.feature.Tokenizer class should 
 become a wrapper around the new Tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115
 ] 

yuhao yang edited comment on SPARK-5563 at 2/4/15 2:22 PM:
---

Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon. Thanks.


was (Author: yuhaoyan):
Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon. Thanks


 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115
 ] 

yuhao yang commented on SPARK-5563:
---

Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon. Thanks


 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305115#comment-14305115
 ] 

yuhao yang edited comment on SPARK-5563 at 2/4/15 2:23 PM:
---

Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon (ETA tomorrow). Thanks.


was (Author: yuhaoyan):
Thanks Joseph for helping create the jira.
Paste previous [comment 
link|https://issues.apache.org/jira/browse/SPARK-1405?focusedCommentId=14302952page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302952]
 here and share the current implementation at 
https://github.com/hhbyyh/OnlineLDA_Spark.

I agree with the suggestion listed above and will propose a PR for more 
detailed discussion soon. Thanks.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305199#comment-14305199
 ] 

yuhao yang commented on SPARK-5563:
---

BTW, batch versions of online variational inference is useful when processing 
small data set (especially toy data in ut).

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5566) Tokenizer for mllib package

2015-02-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305172#comment-14305172
 ] 

yuhao yang commented on SPARK-5566:
---

Actually I believe many current code like Word2Vec and HashingTF share the 
similar data flow and it's best if we can take the common requirement into 
consideration. 

 Tokenizer for mllib package
 ---

 Key: SPARK-5566
 URL: https://issues.apache.org/jira/browse/SPARK-5566
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 There exist tokenizer classes in the spark.ml.feature package and in the 
 LDAExample in the spark.examples.mllib package.  The Tokenizer in the 
 LDAExample is more advanced and should be made into a full-fledged public 
 class in spark.mllib.feature.  The spark.ml.feature.Tokenizer class should 
 become a wrapper around the new Tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-19 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5282.
-

fixed

 RowMatrix easily gets int overflow in the memory size warning
 -

 Key: SPARK-5282
 URL: https://issues.apache.org/jira/browse/SPARK-5282
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Trivial
 Fix For: 1.3.0, 1.2.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 The warning in the RowMatrix will easily get int overflow when the cols is 
 larger than 16385.
 minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-01-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280025#comment-14280025
 ] 

yuhao yang commented on SPARK-5186:
---

I just updated the PR with a hashCode fix. Please help review at will.

 Vector.equals  and Vector.hashCode are very inefficient and fail on 
 SparseVectors with large size
 -

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-16 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5234.
-

fixed

 examples for ml don't have sparkContext.stop
 

 Key: SPARK-5234
 URL: https://issues.apache.org/jira/browse/SPARK-5234
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
 Environment: all
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Trivial
 Fix For: 1.3.0, 1.2.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 Not sure why sc.stop() is not in the 
 org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
 SimpleTextClassificationPipeline}. 
 I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5282:
-

 Summary: RowMatrix easily gets int overflow in the memory size 
warning
 Key: SPARK-5282
 URL: https://issues.apache.org/jira/browse/SPARK-5282
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Trivial


The warning in the RowMatrix will easily get int overflow when the cols is 
larger than 16385.

minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280159#comment-14280159
 ] 

yuhao yang commented on SPARK-5282:
---

typical wrong message: Row matrix: 17000 cloumns will require at least 
-1982967296 bytes of memory!

PR on the way.

 RowMatrix easily gets int overflow in the memory size warning
 -

 Key: SPARK-5282
 URL: https://issues.apache.org/jira/browse/SPARK-5282
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Trivial
   Original Estimate: 1h
  Remaining Estimate: 1h

 The warning in the RowMatrix will easily get int overflow when the cols is 
 larger than 16385.
 minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5717) add sc.stop to LDA examples

2015-02-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5717.
-

merged. Thanks

 add sc.stop to LDA examples
 ---

 Key: SPARK-5717
 URL: https://issues.apache.org/jira/browse/SPARK-5717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Trivial
 Fix For: 1.3.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Trivial. add sc stop and reorganize import in LDAExample and JavaLDAExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-25 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5384.
-

fixed

 Vectors.sqdist return inconsistent result for sparse/dense vectors when the 
 vectors have different lengths
 --

 Key: SPARK-5384
 URL: https://issues.apache.org/jira/browse/SPARK-5384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.1
 Environment: centos, others should be similar
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Critical
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 For two vectors of different lengths, Vectors.sqdist would return different 
 result when the vectors are represented as sparse and dense respectively. 
 Sample:   
 val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
 val s2 = new SparseVector(1, Array(0), Array(9.0))
 val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
 val d2 = new DenseVector(Array(9.0))
 println(s1 == d1  s2 == d2)
 println(Vectors.sqdist(s1, s2))
 println(Vectors.sqdist(d1, d2))
 result:
  true
  93.0
  64.0
 More precisely, for the extra part, Vectors.sqdist would include it for 
 sparse vectors and exclude it for dense vectors. I'll send a PR and we can 
 have more detailed discussion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-01-25 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5406:
-

 Summary: LocalLAPACK mode in RowMatrix.computeSVD should have much 
smaller upper bound
 Key: SPARK-5406
 URL: https://issues.apache.org/jira/browse/SPARK-5406
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor


In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation:

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
thus, n  17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-01-26 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5406:
--
Description: 
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
( 
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala
   ):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
thus, n  17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.


  was:
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
thus, n  17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.



 LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
 -

 Key: SPARK-5406
 URL: https://issues.apache.org/jira/browse/SPARK-5406
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke 
 brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's 
 implementation
 ( 
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala
):
   val workSize = ( 3
 * scala.math.min(m, n)
 * scala.math.min(m, n)
 + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
   * scala.math.min(m, n) + 4 * scala.math.min(m, n))
   )
   val work = new Array[Double](workSize)
 as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
 thus, n  17515.
 This jira is only the first step. If possbile, I hope spark can handle matrix 
 computation up to 80K * 80K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-01-26 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5406:
--
Description: 
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
thus, n  17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.


  was:
In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke brzSvd. 
Yet breeze svd for dense matrix has latent constraint. In it's implementation:

  val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
  * scala.math.min(m, n) + 4 * scala.math.min(m, n))
  )
  val work = new Array[Double](workSize)

as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
thus, n  17515.

This jira is only the first step. If possbile, I hope spark can handle matrix 
computation up to 80K * 80K.



 LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
 -

 Key: SPARK-5406
 URL: https://issues.apache.org/jira/browse/SPARK-5406
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke 
 brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's 
 implementation
 (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala):
   val workSize = ( 3
 * scala.math.min(m, n)
 * scala.math.min(m, n)
 + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
   * scala.math.min(m, n) + 4 * scala.math.min(m, n))
   )
   val work = new Array[Double](workSize)
 as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
 thus, n  17515.
 This jira is only the first step. If possbile, I hope spark can handle matrix 
 computation up to 80K * 80K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5406) LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound

2015-02-01 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5406.
-

fix and merged. Thanks

 LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound
 -

 Key: SPARK-5406
 URL: https://issues.apache.org/jira/browse/SPARK-5406
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 In RowMatrix.computeSVD, under LocalLAPACK mode, the code would invoke 
 brzSvd. Yet breeze svd for dense matrix has latent constraint. In it's 
 implementation
 ( 
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala
):
   val workSize = ( 3
 * scala.math.min(m, n)
 * scala.math.min(m, n)
 + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
   * scala.math.min(m, n) + 4 * scala.math.min(m, n))
   )
   val work = new Array[Double](workSize)
 as a result, column num must satisfy 7 * n * n + 4 * n  Int.MaxValue
 thus, n  17515.
 This jira is only the first step. If possbile, I hope spark can handle matrix 
 computation up to 80K * 80K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5510) How can I fix the spark-submit script and then running the program on cluster ?

2015-02-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300939#comment-14300939
 ] 

yuhao yang commented on SPARK-5510:
---

https://spark.apache.org/community.html
check the mailing list section.

 How can I fix the spark-submit script and then running the program on cluster 
 ?
 ---

 Key: SPARK-5510
 URL: https://issues.apache.org/jira/browse/SPARK-5510
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.2
Reporter: hash-x
  Labels: Help!!, spark-submit

 Reference: My Question is how can I fix the script and can submit the program 
 to a Master from my laptop? Not submit the program from a cluster. Submit 
 program from Node 2 is work for me.But the laptop is not!How can i do to fix 
 ??? help!!!
 I have looked the follow Email and I accept the recommend of One - run 
 spark-shell from a cluster node! But I want to solve the program with the 
 recommend of 2.But I am confused..
 Hi Ken,
 This is unfortunately a limitation of spark-shell and the way it works on the 
 standalone mode.
 spark-shell sets an environment variable, SPARK_HOME, which tells Spark where 
 to find its
 code installed on the cluster. This means that the path on your laptop must 
 be the same as
 on the cluster, which is not the case. I recommend one of two things:
 1) Either run spark-shell from a cluster node, where it will have the right 
 path. (In general
 it’s also better for performance to have it close to the cluster)
 2) Or, edit the spark-shell script and re-export SPARK_HOME right before it 
 runs the Java
 command (ugly but will probably work).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-13 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5234:
-

 Summary: examples for ml don't have sparkContext.stop
 Key: SPARK-5234
 URL: https://issues.apache.org/jira/browse/SPARK-5234
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
 Environment: all
Reporter: yuhao yang
Priority: Trivial
 Fix For: 1.3.0


Not sure why sc.stop() is not in the 
org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
SimpleTextClassificationPipeline}. 

I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-01-13 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5243:
-

 Summary: Spark will hang if (driver memory + executor memory) 
exceeds limit on a 1-worker cluster
 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor


Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I am preparing PR for the case. And I would like to know your opinions about if 
a fix is needed and better fix options.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270869#comment-14270869
 ] 

yuhao yang commented on SPARK-1405:
---

Great design doc and solid proposal. 

I noticed the online variational EM mentioned in the doc, for which I have 
developed a spark implementation. The work was based on an actual customer 
scenario and has exhibited remarkable speed and economized memory usage. The 
result is as good as the “batch” LDA, and with handy support of stream text 
from the online nature. 

Right now we are turning it into graph-based and will perform further 
evaluation afterwards.  The algorithm looks promising to us and can be helpful 
in many cases. For now I don’t find online LDA will make the API design more 
complicated as it’s more like an incremental work. Just want to bring up the 
possibility in case anyone finds a conflict.

Reference: [online 
LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] by 
[Matt Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David 
M.Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html] 

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5717) add sc.stop to LDA examples

2015-02-10 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5717:
-

 Summary: add sc.stop to LDA examples
 Key: SPARK-5717
 URL: https://issues.apache.org/jira/browse/SPARK-5717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: yuhao yang
Priority: Trivial


Trivial. add sc stop and reorganize import in LDAExample and JavaLDAExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-02-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5243:
--
Description: 
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed and better fix 
options.



  was:
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I am preparing PR for the case. And I would like to know your opinions about if 
a fix is needed and better fix options.




 Spark will hang if (driver memory + executor memory) exceeds limit on a 
 1-worker cluster
 

 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor

 Spark will hang if calling spark-submit under the conditions:
 1. the cluster has only one worker.
 2. driver memory + executor memory  worker memory
 3. deploy-mode = cluster
 This usually happens during development for beginners.
 There should be some exit mechanism or at least a warning message in the 
 output of the spark-submit.
 I would like to know your opinions about if a fix is needed and better fix 
 options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-02-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5243:
--
Description: 
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed (is this by 
design?) and better fix options.



  was:
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed and better fix 
options.




 Spark will hang if (driver memory + executor memory) exceeds limit on a 
 1-worker cluster
 

 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor

 Spark will hang if calling spark-submit under the conditions:
 1. the cluster has only one worker.
 2. driver memory + executor memory  worker memory
 3. deploy-mode = cluster
 This usually happens during development for beginners.
 There should be some exit mechanism or at least a warning message in the 
 output of the spark-submit.
 I would like to know your opinions about if a fix is needed (is this by 
 design?) and better fix options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350
 ] 

yuhao yang commented on SPARK-5563:
---

Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation provided by Matthew D. 
Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350
 ] 

yuhao yang edited comment on SPARK-5563 at 3/17/15 1:13 AM:


Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation (C++) provided by Matthew 
D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.


was (Author: yuhaoyan):
Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation provided by Matthew D. 
Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-03-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-6374:
-

 Summary: Add getter for GeneralizedLinearAlgorithm
 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor


I find it's better to have getter for NumFeatures and addIntercept within 
GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the 
value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6177) LDA should check partitions size of the input

2015-03-09 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6177:
--
Description: 
Add comment to introduce coalesce to LDA example to avoid the possible massive 
partitions from sc.textFile.

sc.textFile will create RDD with one partition for each file, and the possible 
massive partitions downgrades LDA performance.

  was:sc.textFile will create RDD with one partition for each file, and the 
possible massive partitions downgrades LDA performance.


 LDA should check partitions size of the input
 -

 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: Examples, MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Add comment to introduce coalesce to LDA example to avoid the possible 
 massive partitions from sc.textFile.
 sc.textFile will create RDD with one partition for each file, and the 
 possible massive partitions downgrades LDA performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6177) Add note for

2015-03-09 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6177:
--
Summary: Add note for   (was: LDA should check partitions size of the input)

 Add note for 
 -

 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: Examples, MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Add comment to introduce coalesce to LDA example to avoid the possible 
 massive partitions from sc.textFile.
 sc.textFile will create RDD with one partition for each file, and the 
 possible massive partitions downgrades LDA performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6177) Add note in LDA example to remind possible coalesce

2015-03-09 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6177:
--
Summary: Add note in LDA example to remind possible coalesce   (was: Add 
note for )

 Add note in LDA example to remind possible coalesce 
 

 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: Examples, MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Add comment to introduce coalesce to LDA example to avoid the possible 
 massive partitions from sc.textFile.
 sc.textFile will create RDD with one partition for each file, and the 
 possible massive partitions downgrades LDA performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6268) KMeans parameter getter methods

2015-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356125#comment-14356125
 ] 

yuhao yang commented on SPARK-6268:
---

Sure, I'll propose a PR very soon. Thanks!

 KMeans parameter getter methods
 ---

 Key: SPARK-6268
 URL: https://issues.apache.org/jira/browse/SPARK-6268
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 KMeans has many setters for parameters.  It should have matching getters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6268) KMeans parameter getter methods

2015-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356106#comment-14356106
 ] 

yuhao yang edited comment on SPARK-6268 at 3/11/15 2:14 AM:


Hi Bradley, I hope this is not rude. Not sure if you want to do this yourself. 
If not, maybe I can help. Thanks.


was (Author: yuhaoyan):
Hi Bradley, I hope this is not rude. Not sure if you want to do this yourself. 
If not, maybe I can help. 

 KMeans parameter getter methods
 ---

 Key: SPARK-6268
 URL: https://issues.apache.org/jira/browse/SPARK-6268
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 KMeans has many setters for parameters.  It should have matching getters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6177) Add note in LDA example to remind possible coalesce

2015-03-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-6177.
-

Fix and merged, thanks.

 Add note in LDA example to remind possible coalesce 
 

 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: Examples, MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Trivial
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Add comment to introduce coalesce to LDA example to avoid the possible 
 massive partitions from sc.textFile.
 sc.textFile will create RDD with one partition for each file, and the 
 possible massive partitions downgrades LDA performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6177) LDA should check partitions size of the input

2015-03-04 Thread yuhao yang (JIRA)
yuhao yang created SPARK-6177:
-

 Summary: LDA should check partitions size of the input
 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6177) LDA should check partitions size of the input

2015-03-04 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6177:
--
Description: sc.textFile will create RDD with one partition for each file, 
and the possible massive partitions downgrades LDA performance.

 LDA should check partitions size of the input
 -

 Key: SPARK-6177
 URL: https://issues.apache.org/jira/browse/SPARK-6177
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
   Original Estimate: 1h
  Remaining Estimate: 1h

 sc.textFile will create RDD with one partition for each file, and the 
 possible massive partitions downgrades LDA performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5384) Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths

2015-01-23 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5384:
-

 Summary: Vectors.sqdist return inconsistent result for 
sparse/dense vectors when the vectors have different lengths
 Key: SPARK-5384
 URL: https://issues.apache.org/jira/browse/SPARK-5384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.1
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Critical
 Fix For: 1.2.1


For two vectors of different lengths, Vectors.sqdist would return different 
result when the vectors are represented as sparse and dense respectively. 
Sample:   
val s1 = new SparseVector(4, Array(0,1,2,3), Array(1.0, 2.0, 3.0, 4.0))
val s2 = new SparseVector(1, Array(0), Array(9.0))
val d1 = new DenseVector(Array(1.0, 2.0, 3.0, 4.0))
val d2 = new DenseVector(Array(9.0))
println(s1 == d1  s2 == d2)
println(Vectors.sqdist(s1, s2))
println(Vectors.sqdist(d1, d2))
result:
 true
 93.0
 64.0

More precisely, for the extra part, Vectors.sqdist would include it for sparse 
vectors and exclude it for dense vectors. I'll send a PR and we can have more 
detailed discussion there.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6693) add to string with max lines and width for matrix

2015-04-03 Thread yuhao yang (JIRA)
yuhao yang created SPARK-6693:
-

 Summary: add to string with max lines and width for matrix
 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor


It's kind of annoying when debugging and found you cannot print out the matrix 
as you want.

original toString of Matrix only print like following, 
0.178101025969091830.5616906241468385... (100 total)
0.9692861997823815 0.015558159784155756  ...
0.8513015122819192 0.031523763918528847  ...
0.5396875653953941 0.3267864552779176...

The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6693) add toString with max lines and width for matrix

2015-04-03 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6693:
--
Summary: add toString with max lines and width for matrix  (was: add to 
string with max lines and width for matrix)

 add toString with max lines and width for matrix
 

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-04-14 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-6374.
-

fix merged. Thanks.

 Add getter for GeneralizedLinearAlgorithm
 -

 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 I find it's better to have getter for NumFeatures and addIntercept within 
 GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get 
 the value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6693) add toString with max lines and width for matrix

2015-04-14 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-6693.
-

Fix merged. Thanks.

 add toString with max lines and width for matrix
 

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508907#comment-14508907
 ] 

yuhao yang commented on SPARK-7090:
---

Hoops, I thought there was something wrong... I'll close the other. Thanks

 Introduce LDAOptimizer to LDA to further improve extensibility
 --

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7089) Introduce LDAOptimizer to LDA to improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-7089.
-
Resolution: Duplicate

Sorry for the duplication

 Introduce LDAOptimizer to LDA to improve extensibility
 --

 Key: SPARK-7089
 URL: https://issues.apache.org/jira/browse/SPARK-7089
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7089) Introduce LDAOptimizer to LDA to improve extensibility

2015-04-23 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7089:
-

 Summary: Introduce LDAOptimizer to LDA to improve extensibility
 Key: SPARK-7089
 URL: https://issues.apache.org/jira/browse/SPARK-7089
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang


LDA was implemented with extensibility in mind. And with the development of 
OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from 
different algorithms.

As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
with some further discussion, we'd like to adjust the code structure a little 
to present the common interface and extension point clearly.

Basically class LDA would be a common entrance for LDA computing. And each LDA 
object will refer to a LDAOptimizer for the concrete algorithm implementation. 
Users can customize LDAOptimizer with specific parameters and assign it to LDA.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7090) Introduce LDAOptimizer to LDA to improve extensibility

2015-04-23 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7090:
-

 Summary: Introduce LDAOptimizer to LDA to improve extensibility
 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang


LDA was implemented with extensibility in mind. And with the development of 
OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from 
different algorithms.

As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
with some further discussion, we'd like to adjust the code structure a little 
to present the common interface and extension point clearly.

Basically class LDA would be a common entrance for LDA computing. And each LDA 
object will refer to a LDAOptimizer for the concrete algorithm implementation. 
Users can customize LDAOptimizer with specific parameters and assign it to LDA.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7090:
--
Summary: Introduce LDAOptimizer to LDA to further improve extensibility  
(was: Introduce LDAOptimizer to LDA to improve extensibility )

 Introduce LDAOptimizer to LDA to further improve extensibility
 --

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7090) Introduce LDAOptimizer to LDA to improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7090:
--
Summary: Introduce LDAOptimizer to LDA to improve extensibility   (was: 
Introduce LDAOptimizer to LDA to improve extensibility)

 Introduce LDAOptimizer to LDA to improve extensibility 
 ---

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang reopened SPARK-7090:
---

Reopen this since 7089 was already closed.

 Introduce LDAOptimizer to LDA to further improve extensibility
 --

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility

2015-04-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508907#comment-14508907
 ] 

yuhao yang edited comment on SPARK-7090 at 4/23/15 12:00 PM:
-

oops, I thought there was something wrong... I'll close the other. Thanks


was (Author: yuhaoyan):
Hoops, I thought there was something wrong... I'll close the other. Thanks

 Introduce LDAOptimizer to LDA to further improve extensibility
 --

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7368) add QR decomposition for RowMatrix

2015-05-05 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7368:
-

 Summary: add QR decomposition for RowMatrix
 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang


Add QR decomposition for RowMatrix.

There's a great distributed algorithm for QR decomposition, which I'm currently 
referring to.

Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for 
tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International 
Conference on Big Data





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7368) add QR decomposition for RowMatrix

2015-05-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529742#comment-14529742
 ] 

yuhao yang commented on SPARK-7368:
---

Oops, I  was not aware of the previous effort. Thanks Joseph and Zongheng.

I'll try with the AMPLab version and send update.

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7475) adjust ldaExample for online LDA

2015-05-07 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7475:
-

 Summary: adjust ldaExample for online LDA
 Key: SPARK-7475
 URL: https://issues.apache.org/jira/browse/SPARK-7475
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Minor


Add a new argument to specify the algorithm applied to LDA, to exhibit the 
basic usage of LDAOptimizer.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7514) Add MinMaxScaler to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537602#comment-14537602
 ] 

yuhao yang commented on SPARK-7514:
---

Class name has always been MinMaxScaler in the code, yet I named jira wrongly...

For the parameters, currently the model looks like:
class MinMaxScalerModel (
+val min: Vector,
+val max: Vector,
+var newBase: Double,
+var scale: Double) extends VectorTransformer 

I have used min, max to store the model statistics. In some articles, the range 
bounds are named newMin / newMax (I think it can be confusing). 
ran out of variable names here...

setCenterScale looks good.






 Add MinMaxScaler to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7514) Add MinMaxScaler to feature transformation

2015-05-11 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537651#comment-14537651
 ] 

yuhao yang commented on SPARK-7514:
---

Thanks Joseph, just one concern for using center as it will change the core 
function from
Normalized( x ) = (x - min) / (max - min) * scale + newBase
to 
Normalized( x ) = ((x - min) / (max - min)  - 0.5 )* scale + center
which seems be to not as straightforward.

Sure we can further discuss it over code.

 Add MinMaxScaler to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7514) Add MinMaxScaler to feature transformation

2015-05-11 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537651#comment-14537651
 ] 

yuhao yang edited comment on SPARK-7514 at 5/11/15 6:41 AM:


Thanks Joseph, just one concern for using center as it will change the core 
function from
Normalized( x ) = (x - min) / (max - min) * scale + newBase
to 
Normalized( x ) = ((x - min) / (max - min)  - 0.5 )* scale + center
which seems not as straightforward.

Sure we can further discuss it over code.


was (Author: yuhaoyan):
Thanks Joseph, just one concern for using center as it will change the core 
function from
Normalized( x ) = (x - min) / (max - min) * scale + newBase
to 
Normalized( x ) = ((x - min) / (max - min)  - 0.5 )* scale + center
which seems be to not as straightforward.

Sure we can further discuss it over code.

 Add MinMaxScaler to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7514) Add MinMaxScaler to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7514:
--
Summary: Add MinMaxScaler to feature transformation  (was: Add 
MinMaxNormalizer to feature transformation)

 Add MinMaxScaler to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7514) Add MinMaxNormalizer to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537527#comment-14537527
 ] 

yuhao yang commented on SPARK-7514:
---

Hi Joseph, that a good idea. I did a simple google:

weka: Class Normalize, takes scaling factor and translation ( same concepts as 
scale and newBase).
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Normalize.html

sklearn.preprocessing.MinMaxScaler, takes min and scale, yet in array format,
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

some implements basic MinMax and takes no extra parameters:
http://docs.pervasive.com/products/DataRush/DF63/javadoc/com/pervasive/datarush/analytics/functions/StatsFunctions.html
http://help.sap.com/saphelp_hanaplatform/helpdata/en/e3/f29fafd4ac43339a1a39407884e545/content.htm?frameset=/en/e6/5c78507a424be58e52877496e2b516/frameset.htmcurrent_toc=/en/32/731a7719f14e488b1f4ab0afae995b/plain.htmnode_id=52


 Add MinMaxNormalizer to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7496) Update Programming guide with Online LDA

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537482#comment-14537482
 ] 

yuhao yang commented on SPARK-7496:
---

Thanks Joseph. PR sent.

 Update Programming guide with Online LDA
 

 Key: SPARK-7496
 URL: https://issues.apache.org/jira/browse/SPARK-7496
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Update LDA subsection of clustering section of MLlib programming guide to 
 include OnlineLDA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7514) Add MinMaxNormalizer to feature transformation

2015-05-10 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7514:
-

 Summary: Add MinMaxNormalizer to feature transformation
 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang


Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where newBase the new minimum number for the feature, and scale controls the 
range after transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7514:
--
Description: 
Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where newBase and scale are parameters of the VectorTransformer. newBase is the 
new minimum number for the feature, and scale controls the range after 
transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm


  was:
Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where newBase the new minimum number for the feature, and scale controls the 
range after transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



 Add MinMaxNormalizer to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a new scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized(x) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7514:
--
Description: 
Add a popular scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized( x ) = (x - min) / (max - min) * scale + newBase

where newBase and scale are parameters of the VectorTransformer. newBase is the 
new minimum number for the feature, and scale controls the range after 
transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm


  was:
Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized( x ) = (x - min) / (max - min) * scale + newBase

where newBase and scale are parameters of the VectorTransformer. newBase is the 
new minimum number for the feature, and scale controls the range after 
transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



 Add MinMaxNormalizer to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a popular scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7514) Add MinMaxNormalizer to feature transformation

2015-05-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7514:
--
Description: 
Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized( x ) = (x - min) / (max - min) * scale + newBase

where newBase and scale are parameters of the VectorTransformer. newBase is the 
new minimum number for the feature, and scale controls the range after 
transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm


  was:
Add a new scaling method to feature component, which is commonly known as 
min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where newBase and scale are parameters of the VectorTransformer. newBase is the 
new minimum number for the feature, and scale controls the range after 
transformation. This is a little complicated than the basic MinMax 
normalization, yet it provides flexibility so that users can control the range 
more specifically. like [0.1, 0.9] in some NN application.

for case that max == min, 0.5 is used as the raw value.

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



 Add MinMaxNormalizer to feature transformation
 --

 Key: SPARK-7514
 URL: https://issues.apache.org/jira/browse/SPARK-7514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a new scaling method to feature component, which is commonly known as 
 min-max normalization or Rescaling.
 Core function is,
 Normalized( x ) = (x - min) / (max - min) * scale + newBase
 where newBase and scale are parameters of the VectorTransformer. newBase is 
 the new minimum number for the feature, and scale controls the range after 
 transformation. This is a little complicated than the basic MinMax 
 normalization, yet it provides flexibility so that users can control the 
 range more specifically. like [0.1, 0.9] in some NN application.
 for case that max == min, 0.5 is used as the raw value.
 reference:
  http://en.wikipedia.org/wiki/Feature_scaling
 http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7496) Update Programming guide with Online LDA

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537114#comment-14537114
 ] 

yuhao yang commented on SPARK-7496:
---

Hi Joseph, just something I got for your reference,

LDA takes in a collection of documents as vectors of word counts. It supports 
different inference algorithms via setOptimizer function. EMLDAOptimizer learns 
clustering using expectation-maximization on the likelihood function, while 
OnlineLDAOptimizer uses iterative mini-batch sampling for online variational 
inference,

After fitting on the documents, LDA provides:

 Update Programming guide with Online LDA
 

 Key: SPARK-7496
 URL: https://issues.apache.org/jira/browse/SPARK-7496
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Update LDA subsection of clustering section of MLlib programming guide to 
 include OnlineLDA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7090) Introduce LDAOptimizer to LDA to further improve extensibility

2015-05-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-7090.
-

Close the jira as code merged. Thanks for the careful review and important fix.

 Introduce LDAOptimizer to LDA to further improve extensibility
 --

 Key: SPARK-7090
 URL: https://issues.apache.org/jira/browse/SPARK-7090
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: yuhao yang
Assignee: yuhao yang
 Fix For: 1.4.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 LDA was implemented with extensibility in mind. And with the development of 
 OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements 
 from different algorithms.
 As Joseph Bradley proposed in https://github.com/apache/spark/pull/4807 and 
 with some further discussion, we'd like to adjust the code structure a little 
 to present the common interface and extension point clearly.
 Basically class LDA would be a common entrance for LDA computing. And each 
 LDA object will refer to a LDAOptimizer for the concrete algorithm 
 implementation. Users can customize LDAOptimizer with specific parameters and 
 assign it to LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7368) add QR decomposition for RowMatrix

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537205#comment-14537205
 ] 

yuhao yang edited comment on SPARK-7368 at 5/10/15 2:54 PM:


Hi Zongheng, since the Amplab version is built upon a different RowMatrix 
implementation. I'm not sure if it's appropriate to make a direct comparison. I 
haven't got the time to review the difference carefully.
If possible, can you please share more information that has been collected, 
like some benchmark or capability. In the meantime, I'll do it for my PR also. 
And what's better is that there's a plan to migrate the Amplab version to 
Spark. 

For anyone with interests, your suggestion and trial will be most welcome. 
Thanks.


was (Author: yuhaoyan):
Hi Zongheng, since the Amplab version is built upon a different RowMatrix 
implementation. I'm not sure if it's appropriate to make a direct comparison. I 
haven't got the time to review the difference carefully.
If possible, can you please share more information that has been collected, 
like some benchmark or capability. In the meantime, I'll do it for my PR also. 
And what's better is that there have been a plan to migrate the Amplab version 
to Spark. Let me know if you have any suggestion about how shall we proceed. 
Thanks.

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7368) add QR decomposition for RowMatrix

2015-05-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537205#comment-14537205
 ] 

yuhao yang commented on SPARK-7368:
---

Hi Zongheng, since the Amplab version is built upon a different RowMatrix 
implementation. I'm not sure if it's appropriate to make a direct comparison. I 
haven't got the time to review the difference carefully.
If possible, can you please share more information that has been collected, 
like some benchmark or capability. In the meantime, I'll do it for my PR also. 
And what's better is that there have been a plan to migrate the Amplab version 
to Spark. Let me know if you have any suggestion about how shall we proceed. 
Thanks.

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7455) Perf test for LDA (EM/online)

2015-05-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543065#comment-14543065
 ] 

yuhao yang commented on SPARK-7455:
---

I'll start to work on this. Any help or suggestion will be welcome.

 Perf test for LDA (EM/online)
 -

 Key: SPARK-7455
 URL: https://issues.apache.org/jira/browse/SPARK-7455
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7496) User guide update for Online LDA

2015-05-13 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-7496.
-

Doc updated.
Thanks for review.

 User guide update for Online LDA
 

 Key: SPARK-7496
 URL: https://issues.apache.org/jira/browse/SPARK-7496
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Joseph K. Bradley
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0


 Update LDA subsection of clustering section of MLlib programming guide to 
 include OnlineLDA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7455) Perf test for LDA (EM/online)

2015-05-19 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551624#comment-14551624
 ] 

yuhao yang commented on SPARK-7455:
---

work in progress https://github.com/databricks/spark-perf/pull/70 

 Perf test for LDA (EM/online)
 -

 Key: SPARK-7455
 URL: https://issues.apache.org/jira/browse/SPARK-7455
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: yuhao yang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5567) Add prediction methods to LDA

2015-06-07 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576358#comment-14576358
 ] 

yuhao yang commented on SPARK-5567:
---

I guess the major consideration is proper code reuse. 
I can provide a quick implementation based on the inference from 
OnlineLDAOptimizer (simply the gamma computation part).  yet I'm not sure if 
it's appropriate to have LocalLDAModel refer to the methods of 
OnlineLDAOptimizer. Possible solution includes 1) have a separate 
OnlineLDAModel, which can invoke the inference of OnlineLDA; 2) Move the 
inference method to object LDAOptimizer.


 Add prediction methods to LDA
 -

 Key: SPARK-5567
 URL: https://issues.apache.org/jira/browse/SPARK-5567
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 LDA currently supports prediction on the training set.  E.g., you can call 
 logLikelihood and topicDistributions to get that info for the training data.  
 However, it should support the same functionality for new (test) documents.
 This will require inference but should be able to use the same code, with a 
 few modification to keep the inferred topics fixed.
 Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5567) Add prediction methods to LDA

2015-06-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578595#comment-14578595
 ] 

yuhao yang commented on SPARK-5567:
---

Hi Joseph, just to be clear. If we're using the MAP prediction you mentioned, 
does it require fold-in Gibbs sampling(and convergence) in the prediction 
process, or just straightforward summation?

I checked the implementation in 
https://github.com/mimno/Mallet/blob/master/src/cc/mallet/topics/TopicInferencer.java#L81.
 Is it something aligned with your idea? 



 Add prediction methods to LDA
 -

 Key: SPARK-5567
 URL: https://issues.apache.org/jira/browse/SPARK-5567
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 LDA currently supports prediction on the training set.  E.g., you can call 
 logLikelihood and topicDistributions to get that info for the training data.  
 However, it should support the same functionality for new (test) documents.
 This will require inference but should be able to use the same code, with a 
 few modification to keep the inferred topics fixed.
 Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8169) Add StopWordsRemover as a transformer

2015-06-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578444#comment-14578444
 ] 

yuhao yang commented on SPARK-8169:
---

This looks useful. I'd like to give it a try if no one has started on this. And 
I think there could be more transformers regarding to text pre-processing. Like 
the text vectorization in LDA example and low-frequency filter. 
Some rough ideas:
The default stop words will probably contains English only, yet the 
StopWordsRemover should support ASCII.
Case sensitivity will be a parameter.

Let me know if I'm missing some requirement.

 Add StopWordsRemover as a transformer
 -

 Key: SPARK-8169
 URL: https://issues.apache.org/jira/browse/SPARK-8169
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng

 StopWordsRemover takes a string array column and outputs a string array 
 column with all defined stop words removed. The transformer should also come 
 with a standard set of stop words as default.
 {code}
 val stopWords = new StopWordsRemover()
   .setInputCol(words)
   .setOutputCol(cleanWords)
   .setStopWords(Array(...)) // optional
 val output = stopWords.transform(df)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-06-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570527#comment-14570527
 ] 

yuhao yang commented on SPARK-7541:
---

I find no more issues.

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7983) Add require for one-based indices in loadLibSVMFile

2015-06-03 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7983:
--
Priority: Minor  (was: Trivial)

 Add require for one-based indices in loadLibSVMFile
 ---

 Key: SPARK-7983
 URL: https://issues.apache.org/jira/browse/SPARK-7983
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Add require for one-based indices in loadLibSVMFile
 Customers frequently use zero-based indices in their LIBSVM files. No 
 warnings or errors from Spark will be reported during their computation 
 afterwards, and usually it will lead to wired result for many algorithms 
 (like GBDT).
 add a quick check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8531) Update ML user guide for MinMaxScaler

2015-06-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8531:
-

 Summary: Update ML user guide for MinMaxScaler
 Key: SPARK-8531
 URL: https://issues.apache.org/jira/browse/SPARK-8531
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: yuhao yang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8529) Set metadata for MinMaxScaler

2015-06-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8529:
-

 Summary: Set metadata for MinMaxScaler
 Key: SPARK-8529
 URL: https://issues.apache.org/jira/browse/SPARK-8529
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor


Add this as an reminder for complementing the output metadata for transformer 
MinMaxScaler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8530) Add Python API for MinMaxScaler

2015-06-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8530:
-

 Summary: Add Python API for MinMaxScaler
 Key: SPARK-8530
 URL: https://issues.apache.org/jira/browse/SPARK-8530
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: yuhao yang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8547) xgboost exploration

2015-06-22 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597012#comment-14597012
 ] 

yuhao yang commented on SPARK-8547:
---

This is definitely useful with many potential users.

 xgboost exploration
 ---

 Key: SPARK-8547
 URL: https://issues.apache.org/jira/browse/SPARK-8547
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Joseph K. Bradley

 There has been quite a bit of excitement around xgboost: 
 [https://github.com/dmlc/xgboost]
 It improves the parallelism of boosting by mixing boosting and bagging (where 
 bagging makes the algorithm more parallel).
 It would worth exploring implementing this within MLlib (probably as a new 
 algorithm).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2015-06-23 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8555:
-

 Summary: Online Variational Inference for the Hierarchical 
Dirichlet Process
 Key: SPARK-8555
 URL: https://issues.apache.org/jira/browse/SPARK-8555
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: yuhao yang
Priority: Minor


The task is created for exploration on the online HDP algorithm described in
http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.

Major advantage for the algorithm: one pass on corpus, streaming friendly, 
automatic K (topic number).

Currently the scope is to support online HDP for topic modeling, i.e. probably 
an optimizer for LDA.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2015-06-23 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-8555:
--
Issue Type: New Feature  (was: Bug)

 Online Variational Inference for the Hierarchical Dirichlet Process
 ---

 Key: SPARK-8555
 URL: https://issues.apache.org/jira/browse/SPARK-8555
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
Priority: Minor

 The task is created for exploration on the online HDP algorithm described in
 http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
 Major advantage for the algorithm: one pass on corpus, streaming friendly, 
 automatic K (topic number).
 Currently the scope is to support online HDP for topic modeling, i.e. 
 probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2015-06-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597237#comment-14597237
 ] 

yuhao yang commented on SPARK-8555:
---

A basic implementation on https://github.com/hhbyyh/HDP, which still needs a 
lot of improvement and evaluation on performance and scalability.

 Online Variational Inference for the Hierarchical Dirichlet Process
 ---

 Key: SPARK-8555
 URL: https://issues.apache.org/jira/browse/SPARK-8555
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
Priority: Minor

 The task is created for exploration on the online HDP algorithm described in
 http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
 Major advantage for the algorithm: one pass on corpus, streaming friendly, 
 automatic K (topic number).
 Currently the scope is to support online HDP for topic modeling, i.e. 
 probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8308) add missing save load for python doc example and tune down MatrixFactorization iterations

2015-06-11 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8308:
-

 Summary: add missing save load for python doc example and tune 
down MatrixFactorization iterations
 Key: SPARK-8308
 URL: https://issues.apache.org/jira/browse/SPARK-8308
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: yuhao yang
Priority: Minor


1. add some missing save/load in python examples, LogisticRegression, 
LinearRegression, NaiveBayes
2. tune down iterations for MatrixFactorization, since current number will 
trigger StackOverflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564263#comment-14564263
 ] 

yuhao yang edited comment on SPARK-7541 at 5/29/15 6:40 AM:


||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Saveable in 1.4 |New Saveable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Saveable in 1.4 |New Saveable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Saveable in 1.4|New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Saveable in 1.4|New Saveable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Saveable in 1.4 |   New Saveable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
|   no public change|   y|

Above contents have been checked and no obvious issue detected. 
And Joseph, do you think we should add save/load wherever available in the 
example documents? 


was (Author: yuhaoyan):
||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Savable in 1.4  |New Savable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Savable in 1.4  |New Savable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Savable in 1.4| New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Savable in 1.4| New Savable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Savable in 1.4 |New Savable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
| 

[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564296#comment-14564296
 ] 

yuhao yang edited comment on SPARK-7541 at 5/29/15 7:14 AM:


Oh, checked means I found no python support for save/load for the model. I 
guess we can add them in 1.5.


was (Author: yuhaoyan):
Oh, checked means I found no python support for save/load for the model. 

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564263#comment-14564263
 ] 

yuhao yang commented on SPARK-7541:
---

||model||Scala UT ||  python UT 
||  changes ||backwards Compatibility||
|LogisticRegressionModel|   LogisticRegressionSuite|
LogisticRegressionModel doctests|no public change|  y
|NaiveBayesModel|   NaiveBayesSuite|
NaiveBayesModel doctests|   save/load 2.0|  y|
|SVMModel|  SVMSuite|   SVMModel 
doctests   |   no public change|   y|
|GaussianMixtureModel|  GaussianMixtureSuite|   checked 
|   New Savable in 1.4  |New Savable in 1.4|
|KMeansModel|   KMeansSuite |   KMeansModel 
doctests|   New Savable in 1.4  |New Savable in 1.4|
|PowerIterationClusteringModel  |PowerIterationClusteringSuite| checked 
|   New Savable in 1.4| New Savable in 1.4|
|Word2VecModel  |   Word2VecSuite   |   checked 
|   New Savable in 1.4| New Savable in 1.4|
|MatrixFactorizationModel  |MatrixFactorizationModelSuite  |
MatrixFactorizationModel doctests | no public change |  y|
|IsotonicRegressionModel|   IsotonicRegressionSuite |   
IsotonicRegressionModel |   New Savable in 1.4 |New Savable in 
1.4|
|LassoModel |   LassoSuite  |   LassoModel 
doctests |   no public change|   y|
|LinearRegressionModel  |   LinearRegressionSuite   |   
LinearRegressionModel doctests  |   no public change|y|
|RidgeRegressionModel   |   RidgeRegressionSuite|   
RidgeRegressionModel doctests   |   no public change|y|
|DecisionTreeModel  |   DecisionTreeSuite|  dt_model.save|  
no public change|   y|
|RandomForestModel| RandomForestSuite   |   rf_model.save   
|   no public change|   y|
|GradientBoostedTreesModel  |GradientBoostedTreesSuite  |gbt_model.sav  
|   no public change|   y|

Above contents have been checked and no obvious issue detected. 
And Joseph, do you think we should add save/load wherever available in the 
example documents? 

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-05-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564296#comment-14564296
 ] 

yuhao yang commented on SPARK-7541:
---

Oh, checked means I found no python support for save/load for the model. 

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7949) update document with some missing save/load

2015-05-29 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-7949:
--
Description: 
As part of 7541,
add save load for examples:
KMeansModel
PowerIterationClusteringModel
Word2VecModel
IsotonicRegressionModel


  was:
add save load for examples:
KMeansModel
PowerIterationClusteringModel
Word2VecModel
IsotonicRegressionModel



 update document with some missing save/load
 ---

 Key: SPARK-7949
 URL: https://issues.apache.org/jira/browse/SPARK-7949
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Minor

 As part of 7541,
 add save load for examples:
 KMeansModel
 PowerIterationClusteringModel
 Word2VecModel
 IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7949) update document with some missing save/load

2015-05-29 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7949:
-

 Summary: update document with some missing save/load
 Key: SPARK-7949
 URL: https://issues.apache.org/jira/browse/SPARK-7949
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Minor


add save load for examples:
KMeansModel
PowerIterationClusteringModel
Word2VecModel
IsotonicRegressionModel




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7983) Add require for one-based indices in loadLibSVMFile

2015-05-31 Thread yuhao yang (JIRA)
yuhao yang created SPARK-7983:
-

 Summary: Add require for one-based indices in loadLibSVMFile
 Key: SPARK-7983
 URL: https://issues.apache.org/jira/browse/SPARK-7983
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Priority: Trivial


Add require for one-based indices in loadLibSVMFile

Customers frequently use zero-based indices in their LIBSVM files. No warnings 
or errors from Spark will be reported during their computation afterwards, and 
usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4

2015-06-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998
 ] 

yuhao yang commented on SPARK-7541:
---

Oh, I haven't checked though all the examples in the markdown documents (And I 
think it's necessary).
The previous jira 7949 just added some missing save/load.

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4

2015-06-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998
 ] 

yuhao yang edited comment on SPARK-7541 at 6/2/15 11:53 AM:


Oh, Thanks. Yet I haven't checked through all the examples in the markdown 
documents (And I think it's necessary).
The previous jira 7949 just added some missing save/load.


was (Author: yuhaoyan):
Oh, I haven't checked though all the examples in the markdown documents (And I 
think it's necessary).
The previous jira 7949 just added some missing save/load.

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7541) Check model save/load for MLlib 1.4

2015-06-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568998#comment-14568998
 ] 

yuhao yang edited comment on SPARK-7541 at 6/2/15 11:56 AM:


Oh, Thanks. Yet I haven't checked through all the examples with save/load in 
the markdown documents (And I think it's necessary).
The previous jira 7949 just added some missing save/load.


was (Author: yuhaoyan):
Oh, Thanks. Yet I haven't checked through all the examples in the markdown 
documents (And I think it's necessary).
The previous jira 7949 just added some missing save/load.

 Check model save/load for MLlib 1.4
 ---

 Key: SPARK-7541
 URL: https://issues.apache.org/jira/browse/SPARK-7541
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 For each model which supports save/load methods, we need to verify:
 * These methods are tested in unit tests in Scala and Python (if save/load is 
 supported in Python).
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8043) update NaiveBayes and SVM examples in doc

2015-06-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8043:
-

 Summary: update NaiveBayes and SVM examples in doc
 Key: SPARK-8043
 URL: https://issues.apache.org/jira/browse/SPARK-8043
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: yuhao yang
Priority: Minor


I found some issues during testing the save/load examples in markdown 
Documents, as a part of 1.4 QA plan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7949) update document with some missing save/load

2015-06-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568339#comment-14568339
 ] 

yuhao yang commented on SPARK-7949:
---

Oh thanks, I thought we should close jira when the code work is done. Shall I 
reopen it ?

 update document with some missing save/load
 ---

 Key: SPARK-7949
 URL: https://issues.apache.org/jira/browse/SPARK-7949
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0


 As part of 7541,
 add save load for examples:
 KMeansModel
 PowerIterationClusteringModel
 Word2VecModel
 IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7949) update document with some missing save/load

2015-06-01 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-7949.
-

 update document with some missing save/load
 ---

 Key: SPARK-7949
 URL: https://issues.apache.org/jira/browse/SPARK-7949
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0


 As part of 7541,
 add save load for examples:
 KMeansModel
 PowerIterationClusteringModel
 Word2VecModel
 IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8744) StringIndexerModel should have public constructor

2015-07-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610558#comment-14610558
 ] 

yuhao yang commented on SPARK-8744:
---

Just a reminder:
There seems to be more jobs to do than simply change the access modifiers. 
Since a passed-in labels will have a larger chance to trigger the unseen 
label exception. Perhaps we should address the exception first.

 StringIndexerModel should have public constructor
 -

 Key: SPARK-8744
 URL: https://issues.apache.org/jira/browse/SPARK-8744
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 It would be helpful to allow users to pass a pre-computed index to create an 
 indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8744) StringIndexerModel should have public constructor

2015-07-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610558#comment-14610558
 ] 

yuhao yang edited comment on SPARK-8744 at 7/1/15 4:10 PM:
---

There seems to be more jobs than simply changing the access modifiers. Since a 
passed-in labels will have a larger chance to trigger the unseen label 
exception. Perhaps we should address the exception first.


was (Author: yuhaoyan):
Just a reminder:
There seems to be more jobs to do than simply change the access modifiers. 
Since a passed-in labels will have a larger chance to trigger the unseen 
label exception. Perhaps we should address the exception first.

 StringIndexerModel should have public constructor
 -

 Key: SPARK-8744
 URL: https://issues.apache.org/jira/browse/SPARK-8744
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 It would be helpful to allow users to pass a pre-computed index to create an 
 indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-07-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609703#comment-14609703
 ] 

yuhao yang commented on SPARK-8703:
---

Thanks Joseph. 

It's true that CountVectorizer and HashingTF share similar input and output, 
yet currently CountVectorizer does not actually inherit anything useful from 
HashingTF. And I kind of like the current clean separation among the feature 
transformers. I'm prone to undo the extension.

About code reuse, given HashingTF is invoking the version in mllib and the fact 
that it's a quite straightforward implementation, it may not be necessary to do 
any refactor for code reuse.

[~viirya] and [~fliang]. Thanks for your opinions and I'd like to know your 
thoughts about it.

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts. Similar to 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8703:
-

 Summary: Add CountVectorizer as a ml transformer to convert 
document to words count vector
 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang


Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's 
appropriate.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >