[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module

2014-11-11 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207199#comment-14207199
 ] 

Doris Xin commented on SPARK-4348:
--

I fully support this. It took a lot of hacking just to override the default 
random module in Python, and it wasn't clear if the override was the ideal 
solution.

 pyspark.mllib.random conflicts with random module
 -

 Key: SPARK-4348
 URL: https://issues.apache.org/jira/browse/SPARK-4348
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 There are conflict in two cases:
 1. random module is used by pyspark.mllib.feature, if the first part of 
 sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
 conflict.
 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
 anything, or it will fail.
 The first one is not fully fixed for user, it will introduce problems in some 
 cases, such as:
 {code}
  import sys
  import sys.insert(0, PATH_OF_MODULE)
  import pyspark
  # use Word2Vec will fail
 {code}
 I'd like to rename mllib/random.py as random/_random.py, then in 
 mllib/__init.py
 {code}
 import pyspark.mllib._random as random
 {code}
 cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3077) ChiSqTest bugs

2014-08-15 Thread Doris Xin (JIRA)
Doris Xin created SPARK-3077:


 Summary: ChiSqTest bugs
 Key: SPARK-3077
 URL: https://issues.apache.org/jira/browse/SPARK-3077
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin


- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test 
should have a null hypothesis
- Correct null hypothesis statement for independence test
- line 59 in TestResult: 0.05 - 0.5



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2993) colStats in Statistics as wrapper around MultivariateStatisticalSummary in Scala and Python

2014-08-12 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2993:


 Summary: colStats in Statistics as wrapper around 
MultivariateStatisticalSummary in Scala and Python
 Key: SPARK-2993
 URL: https://issues.apache.org/jira/browse/SPARK-2993
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2515) Chi-squared test

2014-08-11 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2515:
-

Summary: Chi-squared test  (was: Hypothesis testing)

 Chi-squared test
 

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2980) Python support for chi-squared test

2014-08-11 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2980:


 Summary: Python support for chi-squared test
 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2937) Separate out sampleByKeyExact in PairRDDFunctions as its own API

2014-08-09 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2937:


 Summary: Separate out sampleByKeyExact in PairRDDFunctions as its 
own API
 Key: SPARK-2937
 URL: https://issues.apache.org/jira/browse/SPARK-2937
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2851) Check API consistency for decision tree

2014-08-08 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin resolved SPARK-2851.
--

Resolution: Done

 Check API consistency for decision tree
 ---

 Key: SPARK-2851
 URL: https://issues.apache.org/jira/browse/SPARK-2851
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Decision tree API consistency across Python/Java/Scala. We might want to add 
 a new constructor to Scala's decision tree to match Python's.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2851) Check API consistency for decision tree

2014-08-07 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin reopened SPARK-2851:
--


Shouldn't have been auto-closed with the PR.

 Check API consistency for decision tree
 ---

 Key: SPARK-2851
 URL: https://issues.apache.org/jira/browse/SPARK-2851
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 Decision tree API consistency across Python/Java/Scala. We might want to add 
 a new constructor to Scala's decision tree to match Python's.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2786) Python correlations

2014-08-01 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2786:


 Summary: Python correlations
 Key: SPARK-2786
 URL: https://issues.apache.org/jira/browse/SPARK-2786
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2782) Spearman correlation computes wrong ranks when numPartitions RDD size

2014-07-31 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2782:


 Summary: Spearman correlation computes wrong ranks when 
numPartitions  RDD size
 Key: SPARK-2782
 URL: https://issues.apache.org/jira/browse/SPARK-2782
 Project: Spark
  Issue Type: Bug
Reporter: Doris Xin


The getRanks logic inside of SpearmanCorrelation returns the wrong ranks when 
numPartitions  size for the input RDDs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-2512) Stratified sampling

2014-07-28 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin reopened SPARK-2512:
--


 Stratified sampling
 ---

 Key: SPARK-2512
 URL: https://issues.apache.org/jira/browse/SPARK-2512
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 PR: https://github.com/apache/spark/pull/1025



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution

2014-07-28 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2724:


 Summary: Python version of Random RDD without support for 
arbitrary distribution
 Key: SPARK-2724
 URL: https://issues.apache.org/jira/browse/SPARK-2724
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin


Python version of [SPARK-2514] but without support for randomRDD and 
randomVectorRDD, which take in any DistributionGenerator objects.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-25 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074802#comment-14074802
 ] 

Doris Xin commented on SPARK-2515:
--

A toString method sounds like a really good idea here actually. I think 
originally we planned the Summary object to hold anything that isn't standard 
across tests, and in the case of chi squared, I can't think of anything else to 
put in there. Having the toString method instead would allow us to have a 
single TestResult class across tests, too.

Sure, we can go with degreesOfFreedom. 

 Hypothesis testing
 --

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib

2014-07-24 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2679:


 Summary: Ser/De for Double to enable calling Java API from python 
in MLlib
 Key: SPARK-2679
 URL: https://issues.apache.org/jira/browse/SPARK-2679
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin


In order to enable Java/Scala APIs to be reused in the Python implementation of 
RandomRDD and Correlations, we need a set of ser/de for the type Double in 
_common.py and PythonMLLibAPI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-24 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073879#comment-14073879
 ] 

Doris Xin commented on SPARK-2515:
--

Here's the proposed API for chi-squared tests (lives in 
org.apache.spark.mllib.stat.Statistics):

{code}
def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult
def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): 
ChiSquareTestResult
{code}

where ChiSquareTestResult : TestResult looks like:

{code}
pValue: Double
df: Array[Int] //normally a single but need to be more for anova
statistic: Double
ChiSquareSummary : Summary
{code}

So a couple points of discussion:
1. Of the many variants of the chi-squared test, what methods in addition to 
pearson do we want to support (hopefully based on popular demand)? 
http://en.wikipedia.org/wiki/Chi-squared_test
2. What special fields should ChiSquareSummary have?

 Hypothesis testing
 --

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2656) Python version without support for exact sample size

2014-07-23 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2656:


 Summary: Python version without support for exact sample size
 Key: SPARK-2656
 URL: https://issues.apache.org/jira/browse/SPARK-2656
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-21 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin closed SPARK-2599.


Resolution: Duplicate

Refer to this issue: https://issues.apache.org/jira/browse/SPARK-2479

 almostEquals mllib.util.TestingUtils does not behave as expected when 
 comparing against 0.0
 ---

 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor

 DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
 would always produce an epsilon of 1  1e-10, causing false failure when 
 comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-20 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2599:


 Summary: almostEquals mllib.util.TestingUtils does not behave as 
expected when comparing against 0.0
 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor


DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
would always produce an epsilon of 1  1e-10, causing false failure when 
comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2512) Stratified sampling

2014-07-20 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067848#comment-14067848
 ] 

Doris Xin commented on SPARK-2512:
--

Hey Xiangrui can you close this one since there's already another JIRA in place 
for this? https://issues.apache.org/jira/browse/SPARK-2082

Thanks.

 Stratified sampling
 ---

 Key: SPARK-2512
 URL: https://issues.apache.org/jira/browse/SPARK-2512
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 PR: https://github.com/apache/spark/pull/1025



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2600) Correlations (Pearson, Spearman)

2014-07-20 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin closed SPARK-2600.


Resolution: Implemented

 Correlations (Pearson, Spearman)
 

 Key: SPARK-2600
 URL: https://issues.apache.org/jira/browse/SPARK-2600
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin
Assignee: Doris Xin





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions

2014-07-20 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2082:
-

Target Version/s: 1.1.0

 Stratified sampling implementation in PairRDDFunctions
 --

 Key: SPARK-2082
 URL: https://issues.apache.org/jira/browse/SPARK-2082
 Project: Spark
  Issue Type: New Feature
Reporter: Doris Xin
Assignee: Doris Xin

 Implementation of stratified sampling that guarantees exact sample size = 
 sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2512) Stratified sampling

2014-07-20 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin closed SPARK-2512.


Resolution: Duplicate

 Stratified sampling
 ---

 Key: SPARK-2512
 URL: https://issues.apache.org/jira/browse/SPARK-2512
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 PR: https://github.com/apache/spark/pull/1025



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-20 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068113#comment-14068113
 ] 

Doris Xin commented on SPARK-2599:
--

Found this in-depth article discussing the different considerations for 
comparing floating point numbers: 
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

My suggestion is the following (a blend of absolute and relative epsilon):

def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = {
  if(x == y || math.abs(x - y)  epsilon) {
true
  } else if(math.abs(x)  math.abs(y)) {
math.abs(x - y) / math.abs(x)  epsilon
  } else {
math.abs(x - y) / math.abs(y)  epsilon
  }
  }

Not the most rigorous but covers most use cases I'd imagine (small numbers get 
caught by the first condition while large numbers with large absolute 
difference but small relative difference would still be considered equal by the 
subsequent conditions).

 almostEquals mllib.util.TestingUtils does not behave as expected when 
 comparing against 0.0
 ---

 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor

 DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
 would always produce an epsilon of 1  1e-10, causing false failure when 
 comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-20 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068113#comment-14068113
 ] 

Doris Xin edited comment on SPARK-2599 at 7/21/14 2:06 AM:
---

Found this in-depth article discussing the different considerations for 
comparing floating point numbers: 
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

My suggestion is the following (a blend of absolute and relative epsilon):

{code}
def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = {
  if(x == y || math.abs(x - y)  epsilon) {
true
  } else if(math.abs(x)  math.abs(y)) {
math.abs(x - y) / math.abs(x)  epsilon
  } else {
math.abs(x - y) / math.abs(y)  epsilon
  }
  }
{code}

Not the most rigorous but covers most use cases I'd imagine (small numbers get 
caught by the first condition while large numbers with large absolute 
difference but small relative difference would still be considered equal by the 
subsequent conditions).


was (Author: dorx):
Found this in-depth article discussing the different considerations for 
comparing floating point numbers: 
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

My suggestion is the following (a blend of absolute and relative epsilon):

def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = {
  if(x == y || math.abs(x - y)  epsilon) {
true
  } else if(math.abs(x)  math.abs(y)) {
math.abs(x - y) / math.abs(x)  epsilon
  } else {
math.abs(x - y) / math.abs(y)  epsilon
  }
  }

Not the most rigorous but covers most use cases I'd imagine (small numbers get 
caught by the first condition while large numbers with large absolute 
difference but small relative difference would still be considered equal by the 
subsequent conditions).

 almostEquals mllib.util.TestingUtils does not behave as expected when 
 comparing against 0.0
 ---

 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor

 DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
 would always produce an epsilon of 1  1e-10, causing false failure when 
 comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-08 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2308:
-

Comment: was deleted

(was: Hey guys,

Sorry to crash the party. I don't think small clusters are actually a problem 
since you're using a fixed sample size instead of a sampling rate. So for small 
clusters whose sizes are comparable to the batchSize, you'd have a sampling 
rate ~1.0, which means the entire cluster is picked up in the sample. 

Alternatively, you can look into congressional sampling: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.1057rep=rep1type=pdf,
 where there's both a fixed size portion and a portion that's proportional to 
the cluster size in each sample.)

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2359) Supporting common statistical functions in MLlib

2014-07-03 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2359:
-

Summary: Supporting common statistical functions in MLlib  (was: Supporting 
common statistical estimators in MLlib)

 Supporting common statistical functions in MLlib
 

 Key: SPARK-2359
 URL: https://issues.apache.org/jira/browse/SPARK-2359
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Reynold Xin
Assignee: Doris Xin

 This is originally proposed by [~falaki].
 This is a proposal for a new package within the Spark distribution to support 
 common statistical estimators. We think consolidating statistical related 
 functions in a separate package will help with readability of core source 
 code and encourage spark users to submit back their functions.
 Please see the initial design document here: 
 https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2182) Scalastyle rule blocking unicode operators

2014-06-18 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2182:
-

Attachment: Screen Shot 2014-06-18 at 3.28.44 PM.png

How I spotted it in Eclipse

 Scalastyle rule blocking unicode operators
 --

 Key: SPARK-2182
 URL: https://issues.apache.org/jira/browse/SPARK-2182
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Andrew Ash
 Attachments: Screen Shot 2014-06-18 at 3.28.44 PM.png


 Some IDEs don't support Scala's [unicode 
 operators|http://www.scala-lang.org/old/node/4723] so we should consider 
 adding a scalastyle rule to block them for wider compatibility among 
 contributors.
 See this PR for a place we reverted a unicode operator: 
 https://github.com/apache/spark/pull/1119



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2145) Add lower bound on sampling rate to guarantee sampling performance

2014-06-14 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2145:


 Summary: Add lower bound on sampling rate to guarantee sampling 
performance
 Key: SPARK-2145
 URL: https://issues.apache.org/jira/browse/SPARK-2145
 Project: Spark
  Issue Type: Improvement
Reporter: Doris Xin
Priority: Minor


For extremely small sampling rates p  10^-10, we want to prevent resampling 
caused by the RNG not returning completely random numbers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions

2014-06-09 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2082:


 Summary: Stratified sampling implementation in PairRDDFunctions
 Key: SPARK-2082
 URL: https://issues.apache.org/jira/browse/SPARK-2082
 Project: Spark
  Issue Type: New Feature
Reporter: Doris Xin


Implementation of stratified sampling that guarantees exact sample size = 
sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization

2014-06-09 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2088:


 Summary: NPE in toString when creationSiteInfo is null after 
deserialization
 Key: SPARK-2088
 URL: https://issues.apache.org/jira/browse/SPARK-2088
 Project: Spark
  Issue Type: Bug
Reporter: Doris Xin


After deserialization, the transient field creationSiteInfo does not get 
backfilled with the default value, but the toString method, which is invoked by 
the serializer, expects the field to always be non-null. The following issue is 
encountered during serialization:

java.lang.NullPointerException
at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198)
at org.apache.spark.rdd.RDD.toString(RDD.scala:1263)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46)
at 
org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:125)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at 

[jira] [Updated] (SPARK-1939) Refactor takeSample method in RDD to use ScaSRS

2014-05-29 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-1939:
-

Summary: Refactor takeSample method in RDD to use ScaSRS  (was: Improve 
takeSample method in RDD)

 Refactor takeSample method in RDD to use ScaSRS
 ---

 Key: SPARK-1939
 URL: https://issues.apache.org/jira/browse/SPARK-1939
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin
Assignee: Doris Xin
  Labels: newbie

 reimplement takeSample with the ScaSRS algorithm



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1939) Improve takeSample method in RDD

2014-05-27 Thread Doris Xin (JIRA)
Doris Xin created SPARK-1939:


 Summary: Improve takeSample method in RDD
 Key: SPARK-1939
 URL: https://issues.apache.org/jira/browse/SPARK-1939
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin


reimplement takeSample with the ScaSRS algorithm



--
This message was sent by Atlassian JIRA
(v6.2#6252)