[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module
[ https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207199#comment-14207199 ] Doris Xin commented on SPARK-4348: -- I fully support this. It took a lot of hacking just to override the default random module in Python, and it wasn't clear if the override was the ideal solution. pyspark.mllib.random conflicts with random module - Key: SPARK-4348 URL: https://issues.apache.org/jira/browse/SPARK-4348 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.1.0, 1.2.0 Reporter: Davies Liu Priority: Blocker There are conflict in two cases: 1. random module is used by pyspark.mllib.feature, if the first part of sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the conflict. 2. Run tests in mllib/xxx.py, the '' should be popped out before import anything, or it will fail. The first one is not fully fixed for user, it will introduce problems in some cases, such as: {code} import sys import sys.insert(0, PATH_OF_MODULE) import pyspark # use Word2Vec will fail {code} I'd like to rename mllib/random.py as random/_random.py, then in mllib/__init.py {code} import pyspark.mllib._random as random {code} cc [~mengxr] [~dorx] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3077) ChiSqTest bugs
Doris Xin created SPARK-3077: Summary: ChiSqTest bugs Key: SPARK-3077 URL: https://issues.apache.org/jira/browse/SPARK-3077 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin - promote nullHypothesis field in ChiSqTestResult to TestResult. Every test should have a null hypothesis - Correct null hypothesis statement for independence test - line 59 in TestResult: 0.05 - 0.5 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2993) colStats in Statistics as wrapper around MultivariateStatisticalSummary in Scala and Python
Doris Xin created SPARK-2993: Summary: colStats in Statistics as wrapper around MultivariateStatisticalSummary in Scala and Python Key: SPARK-2993 URL: https://issues.apache.org/jira/browse/SPARK-2993 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2515) Chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2515: - Summary: Chi-squared test (was: Hypothesis testing) Chi-squared test Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2980) Python support for chi-squared test
Doris Xin created SPARK-2980: Summary: Python support for chi-squared test Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2937) Separate out sampleByKeyExact in PairRDDFunctions as its own API
Doris Xin created SPARK-2937: Summary: Separate out sampleByKeyExact in PairRDDFunctions as its own API Key: SPARK-2937 URL: https://issues.apache.org/jira/browse/SPARK-2937 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2851) Check API consistency for decision tree
[ https://issues.apache.org/jira/browse/SPARK-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin resolved SPARK-2851. -- Resolution: Done Check API consistency for decision tree --- Key: SPARK-2851 URL: https://issues.apache.org/jira/browse/SPARK-2851 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Decision tree API consistency across Python/Java/Scala. We might want to add a new constructor to Scala's decision tree to match Python's. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2851) Check API consistency for decision tree
[ https://issues.apache.org/jira/browse/SPARK-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin reopened SPARK-2851: -- Shouldn't have been auto-closed with the PR. Check API consistency for decision tree --- Key: SPARK-2851 URL: https://issues.apache.org/jira/browse/SPARK-2851 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 Decision tree API consistency across Python/Java/Scala. We might want to add a new constructor to Scala's decision tree to match Python's. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2786) Python correlations
Doris Xin created SPARK-2786: Summary: Python correlations Key: SPARK-2786 URL: https://issues.apache.org/jira/browse/SPARK-2786 Project: Spark Issue Type: Sub-task Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2782) Spearman correlation computes wrong ranks when numPartitions RDD size
Doris Xin created SPARK-2782: Summary: Spearman correlation computes wrong ranks when numPartitions RDD size Key: SPARK-2782 URL: https://issues.apache.org/jira/browse/SPARK-2782 Project: Spark Issue Type: Bug Reporter: Doris Xin The getRanks logic inside of SpearmanCorrelation returns the wrong ranks when numPartitions size for the input RDDs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2512) Stratified sampling
[ https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin reopened SPARK-2512: -- Stratified sampling --- Key: SPARK-2512 URL: https://issues.apache.org/jira/browse/SPARK-2512 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin PR: https://github.com/apache/spark/pull/1025 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution
Doris Xin created SPARK-2724: Summary: Python version of Random RDD without support for arbitrary distribution Key: SPARK-2724 URL: https://issues.apache.org/jira/browse/SPARK-2724 Project: Spark Issue Type: Sub-task Reporter: Doris Xin Python version of [SPARK-2514] but without support for randomRDD and randomVectorRDD, which take in any DistributionGenerator objects. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074802#comment-14074802 ] Doris Xin commented on SPARK-2515: -- A toString method sounds like a really good idea here actually. I think originally we planned the Summary object to hold anything that isn't standard across tests, and in the case of chi squared, I can't think of anything else to put in there. Having the toString method instead would allow us to have a single TestResult class across tests, too. Sure, we can go with degreesOfFreedom. Hypothesis testing -- Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib
Doris Xin created SPARK-2679: Summary: Ser/De for Double to enable calling Java API from python in MLlib Key: SPARK-2679 URL: https://issues.apache.org/jira/browse/SPARK-2679 Project: Spark Issue Type: Sub-task Reporter: Doris Xin In order to enable Java/Scala APIs to be reused in the Python implementation of RandomRDD and Correlations, we need a set of ser/de for the type Double in _common.py and PythonMLLibAPI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2515) Hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073879#comment-14073879 ] Doris Xin commented on SPARK-2515: -- Here's the proposed API for chi-squared tests (lives in org.apache.spark.mllib.stat.Statistics): {code} def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): ChiSquareTestResult {code} where ChiSquareTestResult : TestResult looks like: {code} pValue: Double df: Array[Int] //normally a single but need to be more for anova statistic: Double ChiSquareSummary : Summary {code} So a couple points of discussion: 1. Of the many variants of the chi-squared test, what methods in addition to pearson do we want to support (hopefully based on popular demand)? http://en.wikipedia.org/wiki/Chi-squared_test 2. What special fields should ChiSquareSummary have? Hypothesis testing -- Key: SPARK-2515 URL: https://issues.apache.org/jira/browse/SPARK-2515 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Support common statistical tests in Spark MLlib. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2656) Python version without support for exact sample size
Doris Xin created SPARK-2656: Summary: Python version without support for exact sample size Key: SPARK-2656 URL: https://issues.apache.org/jira/browse/SPARK-2656 Project: Spark Issue Type: Sub-task Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
[ https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin closed SPARK-2599. Resolution: Duplicate Refer to this issue: https://issues.apache.org/jira/browse/SPARK-2479 almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 --- Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
Doris Xin created SPARK-2599: Summary: almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2512) Stratified sampling
[ https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067848#comment-14067848 ] Doris Xin commented on SPARK-2512: -- Hey Xiangrui can you close this one since there's already another JIRA in place for this? https://issues.apache.org/jira/browse/SPARK-2082 Thanks. Stratified sampling --- Key: SPARK-2512 URL: https://issues.apache.org/jira/browse/SPARK-2512 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin PR: https://github.com/apache/spark/pull/1025 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2600) Correlations (Pearson, Spearman)
[ https://issues.apache.org/jira/browse/SPARK-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin closed SPARK-2600. Resolution: Implemented Correlations (Pearson, Spearman) Key: SPARK-2600 URL: https://issues.apache.org/jira/browse/SPARK-2600 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin Assignee: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2082: - Target Version/s: 1.1.0 Stratified sampling implementation in PairRDDFunctions -- Key: SPARK-2082 URL: https://issues.apache.org/jira/browse/SPARK-2082 Project: Spark Issue Type: New Feature Reporter: Doris Xin Assignee: Doris Xin Implementation of stratified sampling that guarantees exact sample size = sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2512) Stratified sampling
[ https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin closed SPARK-2512. Resolution: Duplicate Stratified sampling --- Key: SPARK-2512 URL: https://issues.apache.org/jira/browse/SPARK-2512 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin PR: https://github.com/apache/spark/pull/1025 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
[ https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068113#comment-14068113 ] Doris Xin commented on SPARK-2599: -- Found this in-depth article discussing the different considerations for comparing floating point numbers: http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm My suggestion is the following (a blend of absolute and relative epsilon): def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = { if(x == y || math.abs(x - y) epsilon) { true } else if(math.abs(x) math.abs(y)) { math.abs(x - y) / math.abs(x) epsilon } else { math.abs(x - y) / math.abs(y) epsilon } } Not the most rigorous but covers most use cases I'd imagine (small numbers get caught by the first condition while large numbers with large absolute difference but small relative difference would still be considered equal by the subsequent conditions). almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 --- Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0
[ https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068113#comment-14068113 ] Doris Xin edited comment on SPARK-2599 at 7/21/14 2:06 AM: --- Found this in-depth article discussing the different considerations for comparing floating point numbers: http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm My suggestion is the following (a blend of absolute and relative epsilon): {code} def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = { if(x == y || math.abs(x - y) epsilon) { true } else if(math.abs(x) math.abs(y)) { math.abs(x - y) / math.abs(x) epsilon } else { math.abs(x - y) / math.abs(y) epsilon } } {code} Not the most rigorous but covers most use cases I'd imagine (small numbers get caught by the first condition while large numbers with large absolute difference but small relative difference would still be considered equal by the subsequent conditions). was (Author: dorx): Found this in-depth article discussing the different considerations for comparing floating point numbers: http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm My suggestion is the following (a blend of absolute and relative epsilon): def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = { if(x == y || math.abs(x - y) epsilon) { true } else if(math.abs(x) math.abs(y)) { math.abs(x - y) / math.abs(x) epsilon } else { math.abs(x - y) / math.abs(y) epsilon } } Not the most rigorous but covers most use cases I'd imagine (small numbers get caught by the first condition while large numbers with large absolute difference but small relative difference would still be considered equal by the subsequent conditions). almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0 --- Key: SPARK-2599 URL: https://issues.apache.org/jira/browse/SPARK-2599 Project: Spark Issue Type: Bug Components: MLlib Reporter: Doris Xin Priority: Minor DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, would always produce an epsilon of 1 1e-10, causing false failure when comparing very small numbers with 0.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2308: - Comment: was deleted (was: Hey guys, Sorry to crash the party. I don't think small clusters are actually a problem since you're using a fixed sample size instead of a sampling rate. So for small clusters whose sizes are comparable to the batchSize, you'd have a sampling rate ~1.0, which means the entire cluster is picked up in the sample. Alternatively, you can look into congressional sampling: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.1057rep=rep1type=pdf, where there's both a fixed size portion and a portion that's proportional to the cluster size in each sample.) Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2359) Supporting common statistical functions in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2359: - Summary: Supporting common statistical functions in MLlib (was: Supporting common statistical estimators in MLlib) Supporting common statistical functions in MLlib Key: SPARK-2359 URL: https://issues.apache.org/jira/browse/SPARK-2359 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Reynold Xin Assignee: Doris Xin This is originally proposed by [~falaki]. This is a proposal for a new package within the Spark distribution to support common statistical estimators. We think consolidating statistical related functions in a separate package will help with readability of core source code and encourage spark users to submit back their functions. Please see the initial design document here: https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2182) Scalastyle rule blocking unicode operators
[ https://issues.apache.org/jira/browse/SPARK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2182: - Attachment: Screen Shot 2014-06-18 at 3.28.44 PM.png How I spotted it in Eclipse Scalastyle rule blocking unicode operators -- Key: SPARK-2182 URL: https://issues.apache.org/jira/browse/SPARK-2182 Project: Spark Issue Type: Bug Components: Build Reporter: Andrew Ash Attachments: Screen Shot 2014-06-18 at 3.28.44 PM.png Some IDEs don't support Scala's [unicode operators|http://www.scala-lang.org/old/node/4723] so we should consider adding a scalastyle rule to block them for wider compatibility among contributors. See this PR for a place we reverted a unicode operator: https://github.com/apache/spark/pull/1119 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2145) Add lower bound on sampling rate to guarantee sampling performance
Doris Xin created SPARK-2145: Summary: Add lower bound on sampling rate to guarantee sampling performance Key: SPARK-2145 URL: https://issues.apache.org/jira/browse/SPARK-2145 Project: Spark Issue Type: Improvement Reporter: Doris Xin Priority: Minor For extremely small sampling rates p 10^-10, we want to prevent resampling caused by the RNG not returning completely random numbers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions
Doris Xin created SPARK-2082: Summary: Stratified sampling implementation in PairRDDFunctions Key: SPARK-2082 URL: https://issues.apache.org/jira/browse/SPARK-2082 Project: Spark Issue Type: New Feature Reporter: Doris Xin Implementation of stratified sampling that guarantees exact sample size = sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization
Doris Xin created SPARK-2088: Summary: NPE in toString when creationSiteInfo is null after deserialization Key: SPARK-2088 URL: https://issues.apache.org/jira/browse/SPARK-2088 Project: Spark Issue Type: Bug Reporter: Doris Xin After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. The following issue is encountered during serialization: java.lang.NullPointerException at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198) at org.apache.spark.rdd.RDD.toString(RDD.scala:1263) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46) at org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:125) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at
[jira] [Updated] (SPARK-1939) Refactor takeSample method in RDD to use ScaSRS
[ https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-1939: - Summary: Refactor takeSample method in RDD to use ScaSRS (was: Improve takeSample method in RDD) Refactor takeSample method in RDD to use ScaSRS --- Key: SPARK-1939 URL: https://issues.apache.org/jira/browse/SPARK-1939 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Doris Xin Assignee: Doris Xin Labels: newbie reimplement takeSample with the ScaSRS algorithm -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1939) Improve takeSample method in RDD
Doris Xin created SPARK-1939: Summary: Improve takeSample method in RDD Key: SPARK-1939 URL: https://issues.apache.org/jira/browse/SPARK-1939 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Doris Xin reimplement takeSample with the ScaSRS algorithm -- This message was sent by Atlassian JIRA (v6.2#6252)