[jira] [Updated] (SPARK-20771) Usability issues with weekofyear()
[ https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-20771: -- Description: The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation or ordering wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. was: The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. > Usability issues with weekofyear() > -- > > Key: SPARK-20771 > URL: https://issues.apache.org/jira/browse/SPARK-20771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiangrui Meng >Priority: Minor > > The weekofyear() implementation follows HIVE / ISO 8601 week number. However > it is not useful because it doesn't return the year of the week start. For > example, > weekofyear("2017-01-01") returns 52 > Anyone using this with groupBy('week) might do the aggregation or ordering > wrong. A better implementation should return the year number of the week as > well. > MySQL's yearweek() is much better in this sense: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. > Maybe we should implement that in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20771) Usability issues with weekofyear()
Xiangrui Meng created SPARK-20771: - Summary: Usability issues with weekofyear() Key: SPARK-20771 URL: https://issues.apache.org/jira/browse/SPARK-20771 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Xiangrui Meng Priority: Minor The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate
Xiangrui Meng created SPARK-20129: - Summary: JavaSparkContext should use SparkContext.getOrCreate Key: SPARK-20129 URL: https://issues.apache.org/jira/browse/SPARK-20129 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 2.1.0 Reporter: Xiangrui Meng It should re-use an existing SparkContext if there is a live one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-20129: - Assignee: Xiangrui Meng > JavaSparkContext should use SparkContext.getOrCreate > > > Key: SPARK-20129 > URL: https://issues.apache.org/jira/browse/SPARK-20129 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It should re-use an existing SparkContext if there is a live one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-20088: - Assignee: Hossein Falaki > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Fix For: 2.2.0 > > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
[ https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-20088. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17423 [https://github.com/apache/spark/pull/17423] > Do not create new SparkContext in SparkR createSparkContext > --- > > Key: SPARK-20088 > URL: https://issues.apache.org/jira/browse/SPARK-20088 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Fix For: 2.2.0 > > > In the implementation of {{createSparkContext}}, we are calling > {code} > new JavaSparkContext() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880786#comment-15880786 ] Xiangrui Meng commented on SPARK-5226: -- I closed this ticket as "Won't Do" due to DBSCAN's high complexity and hence bad scalability as documented in http://staff.itee.uq.edu.au/taoyf/paper/sigmod15-dbscan.pdf. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-5226. Resolution: Won't Fix > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858963#comment-15858963 ] Xiangrui Meng commented on SPARK-18924: --- I'm going to work on this one. So removed myself from "shepherd". > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18924: -- Shepherd: (was: Xiangrui Meng) > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18924: - Assignee: Xiangrui Meng > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18924: -- Description: SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for improvement, discussed below. 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, which collects rows to local and then constructs columns. However, * it ignores column types and results boxing/unboxing overhead * it collects all objects to driver and results high GC pressure A relatively simple change is to implement specialized column builder based on column types, primitive types in particular. We need to handle null/NA values properly. A simple data structure we can use is {code} val size: Int val nullIndexes: Array[Int] val notNullValues: Array[T] // specialized for primitive types {code} On the R side, we can use `readBin` and `writeBin` to read the entire vector in a single method call. The speed seems reasonable (at the order of GB/s): {code} > x <- runif(1000) # 1e7, not 1e6 > system.time(r <- writeBin(x, raw(0))) user system elapsed 0.036 0.021 0.059 > > system.time(y <- readBin(r, double(), 1000)) user system elapsed 0.015 0.007 0.024 {code} This is just a proposal that needs to be discussed and formalized. But in general, it should be feasible to obtain 20x or more performance gain. was: SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round-trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for imp
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760439#comment-15760439 ] Xiangrui Meng commented on SPARK-18924: --- cc: [~shivaram] [~felixcheung] [~falaki] [~yanboliang] for discussion. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round-trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collect rows to local and then construct columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null values > properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
Xiangrui Meng created SPARK-18924: - Summary: Improve collect/createDataFrame performance in SparkR Key: SPARK-18924 URL: https://issues.apache.org/jira/browse/SPARK-18924 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Xiangrui Meng Priority: Critical SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round-trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for improvement, discussed below. 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, which collect rows to local and then construct columns. However, * it ignores column types and results boxing/unboxing overhead * it collects all objects to driver and results high GC pressure A relatively simple change is to implement specialized column builder based on column types, primitive types in particular. We need to handle null values properly. A simple data structure we can use is {code} val size: Int val nullIndexes: Array[Int] val notNullValues: Array[T] // specialized for primitive types {code} On the R side, we can use `readBin` and `writeBin` to read the entire vector in a single method call. The speed seems reasonable (at the order of GB/s): {code} > x <- runif(1000) # 1e7, not 1e6 > system.time(r <- writeBin(x, raw(0))) user system elapsed 0.036 0.021 0.059 > > system.time(y <- readBin(r, double(), 1000)) user system elapsed 0.015 0.007 0.024 {code} This is just a proposal that needs to be discussed and formalized. But in general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18849: -- Description: Make a final pass over the vignettes and ensure the content is consistent. * remove "since version" because is not that useful for vignettes * re-order/group the list of ML algorithms so there exists a logical ordering * check for warning or error in output message * anything else that seems out of place was: Make a final pass over the vignettes and ensure the content is consistent. * remove "since version" because is not that useful for vignettes * re-order/group the list of ML algorithms so there exists a logical ordering * check for warning or error in output message * > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18849: -- Description: Make a final pass over the vignettes and ensure the content is consistent. * remove "since version" because is not that useful for vignettes * re-order/group the list of ML algorithms so there exists a logical ordering * check for warning or error in output message * was: Make a final pass over the vignettes and ensure the content is consistent. * remove "since version" because is not that useful for vignettes * re-order/group the list of ML algorithms so there exists a logical ordering * ? > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18849: -- Assignee: Felix Cheung > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18793) SparkR vignette update: random forest
[ https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18793. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16264 [https://github.com/apache/spark/pull/16264] > SparkR vignette update: random forest > - > > Key: SPARK-18793 > URL: https://issues.apache.org/jira/browse/SPARK-18793 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > Fix For: 2.1.1, 2.2.0 > > > Update vignettes to cover randomForest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18794) SparkR vignette update: gbt
[ https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18794. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16264 [https://github.com/apache/spark/pull/16264] > SparkR vignette update: gbt > --- > > Key: SPARK-18794 > URL: https://issues.apache.org/jira/browse/SPARK-18794 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > Fix For: 2.1.1, 2.2.0 > > > Update vignettes to cover gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18849) Vignettes final checks for Spark 2.1
Xiangrui Meng created SPARK-18849: - Summary: Vignettes final checks for Spark 2.1 Key: SPARK-18849 URL: https://issues.apache.org/jira/browse/SPARK-18849 Project: Spark Issue Type: Documentation Components: Documentation, SparkR Reporter: Xiangrui Meng Make a final pass over the vignettes and ensure the content is consistent. * remove "since version" because is not that useful for vignettes * re-order/group the list of ML algorithms so there exists a logical ordering * ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18795) SparkR vignette update: ksTest
[ https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18795: -- Assignee: Miao Wang (was: Xiangrui Meng) > SparkR vignette update: ksTest > -- > > Key: SPARK-18795 > URL: https://issues.apache.org/jira/browse/SPARK-18795 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Miao Wang > > Update vignettes to cover ksTest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18793) SparkR vignette update: random forest
[ https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18793: - Assignee: Xiangrui Meng > SparkR vignette update: random forest > - > > Key: SPARK-18793 > URL: https://issues.apache.org/jira/browse/SPARK-18793 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover randomForest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest
[ https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744359#comment-15744359 ] Xiangrui Meng commented on SPARK-18795: --- [~wangmiao1981] Any updates? > SparkR vignette update: ksTest > -- > > Key: SPARK-18795 > URL: https://issues.apache.org/jira/browse/SPARK-18795 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Miao Wang > > Update vignettes to cover ksTest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18794) SparkR vignette update: gbt
[ https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18794: - Assignee: Xiangrui Meng > SparkR vignette update: gbt > --- > > Key: SPARK-18794 > URL: https://issues.apache.org/jira/browse/SPARK-18794 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover gradient boosted trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18795) SparkR vignette update: ksTest
[ https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18795: - Assignee: Xiangrui Meng > SparkR vignette update: ksTest > -- > > Key: SPARK-18795 > URL: https://issues.apache.org/jira/browse/SPARK-18795 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover ksTest -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18797) Update spark.logit in sparkr-vignettes
[ https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18797. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16222 [https://github.com/apache/spark/pull/16222] > Update spark.logit in sparkr-vignettes > -- > > Key: SPARK-18797 > URL: https://issues.apache.org/jira/browse/SPARK-18797 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Miao Wang > Fix For: 2.1.1, 2.2.0 > > > spark.logit is added in 2.1. We need to update spark-vignettes to reflect the > changes. This is part of SparkR QA work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18812) Clarify "Spark ML"
[ https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18812. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16241 [https://github.com/apache/spark/pull/16241] > Clarify "Spark ML" > -- > > Key: SPARK-18812 > URL: https://issues.apache.org/jira/browse/SPARK-18812 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.1.1, 2.2.0 > > > It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18812) Clarify "Spark ML"
Xiangrui Meng created SPARK-18812: - Summary: Clarify "Spark ML" Key: SPARK-18812 URL: https://issues.apache.org/jira/browse/SPARK-18812 Project: Spark Issue Type: Documentation Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17822: -- Fix Version/s: 2.0.3 > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-17822. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16154 [https://github.com/apache/spark/pull/16154] > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Fix For: 2.1.1, 2.2.0 > > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734678#comment-15734678 ] Xiangrui Meng commented on SPARK-17647: --- [~r...@databricks.com] [~yhuai] I think this is a critical correctness bug, which should be fixed in 2.1. Thoughts? > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18792) SparkR vignette update: logit
[ https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18792. --- Resolution: Duplicate > SparkR vignette update: logit > - > > Key: SPARK-18792 > URL: https://issues.apache.org/jira/browse/SPARK-18792 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover logit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18792) SparkR vignette update: logit
[ https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734024#comment-15734024 ] Xiangrui Meng commented on SPARK-18792: --- [~wangmiao1981] Please check existing sub-tasks before creating new ones. I'm closing mine. > SparkR vignette update: logit > - > > Key: SPARK-18792 > URL: https://issues.apache.org/jira/browse/SPARK-18792 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover logit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18792) SparkR vignette update: logit
[ https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18792: - Assignee: Xiangrui Meng > SparkR vignette update: logit > - > > Key: SPARK-18792 > URL: https://issues.apache.org/jira/browse/SPARK-18792 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update vignettes to cover logit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17823) Make JVMObjectTracker.objMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-17823. --- Resolution: Duplicate This is contained by SPARK-17822. > Make JVMObjectTracker.objMap thread-safe > > > Key: SPARK-17823 > URL: https://issues.apache.org/jira/browse/SPARK-17823 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai > > Since JVMObjectTracker.objMap is a global map, it makes sense to make it > thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727981#comment-15727981 ] Xiangrui Meng commented on SPARK-18762: --- Thanks! Please make sure spark history server still works when ssl is enabled. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Description: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. was: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this cause several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727929#comment-15727929 ] Xiangrui Meng edited comment on SPARK-18762 at 12/7/16 6:56 AM: cc [~hayashidac] [~sarutak] [~lian cheng] was (Author: mengxr): cc [~hayashidac] [~sarutak] > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727929#comment-15727929 ] Xiangrui Meng commented on SPARK-18762: --- cc [~hayashidac] [~sarutak] > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Description: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this cause several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. was: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Priority: Blocker (was: Critical) > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Priority: Critical (was: Major) > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Critical > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18762) Web UI should be http:4040 instead of https:4040
Xiangrui Meng created SPARK-18762: - Summary: Web UI should be http:4040 instead of https:4040 Key: SPARK-18762 URL: https://issues.apache.org/jira/browse/SPARK-18762 Project: Spark Issue Type: Bug Components: Spark Shell, Web UI Affects Versions: 2.1.0 Reporter: Xiangrui Meng When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722886#comment-15722886 ] Xiangrui Meng commented on SPARK-17822: --- The issue comes with multiple RBackend connections. It is feasible to create multiple RBackend sessions. But they share the same `JVMObjectTracker`. It cannot tell which JVM object is from which RBackend. If an RBackend died without proper cleaning, we got a memory leak. I will send a PR to make JVMObjectTracker a member variable of RBackend. There should be more TODOs to allow concurrent RBackend sessions. But this would help solve the most critical issue. > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-17822: - Assignee: Xiangrui Meng > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15716690#comment-15716690 ] Xiangrui Meng commented on SPARK-17822: --- I will take a look. > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15707642#comment-15707642 ] Xiangrui Meng commented on SPARK-18374: --- See the discussion here: https://github.com/nltk/nltk_data/issues/22. Including `won` is apparently a mistake. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18317: -- Attachment: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html I checked the result from japi-compliance-checker. All binary incompatible changes reported are either private or package private. So we are good to go. > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18317. --- Resolution: Done > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, > spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-18317: - Assignee: Xiangrui Meng > ML, Graph 2.1 QA: API: Binary incompatible changes > -- > > Key: SPARK-18317 > URL: https://issues.apache.org/jira/browse/SPARK-18317 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652366#comment-15652366 ] Xiangrui Meng commented on SPARK-18390: --- This is a bug because the user didn't ask a cartesian join. Anyway, this was fixed. > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng >Assignee: Srinath > > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-18390. --- Resolution: Duplicate > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng >Assignee: Srinath > > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18390: -- Description: I hit this error when I tried to test skewed joins. {code} val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) val df3 = spark.range(1e9.toInt) df3.join(df2, df3("id") === df2("one")).count() {code} throws {code} org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; {code} This is probably not the right behavior because it was not the user who suggested using cartesian product. SQL picked it while knowing it is not enabled. was: I hit this error when I tried to test skewed joins. {code} val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) val df3 = spark.range(1e9.toInt) df3.join(df2, df3("id") === df2("one")).count() {code} throws {noformat} org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; {noformat} This is probably not the right behavior because it was not the user who suggested using cartesian product. SQL picked it while knowing it is not enabled. > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng > > I hit this error when I tried to test skewed joins. > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > {code} > org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > {code} > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18390: -- Description: I hit this error when I tried to test skewed joins. {code} val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) val df3 = spark.range(1e9.toInt) df3.join(df2, df3("id") === df2("one")).count() {code} throws {noformat} org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; {noformat} This is probably not the right behavior because it was not the user who suggested using cartesian product. SQL picked it while knowing it is not enabled. was: {code} val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) val df3 = spark.range(1e9.toInt) df3.join(df2, df3("id") === df2("one")).count() {code} throws {noformat} org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; {noformat} This is probably not the right behavior because it was not the user who suggested using cartesian product. SQL picked it while knowing it is not enabled. > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng > > I hit this error when I tried to test skewed joins. > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > {noformat} > org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > {noformat} > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
[ https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651859#comment-15651859 ] Xiangrui Meng commented on SPARK-18390: --- cc: [~yhuai] [~lian cheng] > Optimized plan tried to use Cartesian join when it is not enabled > - > > Key: SPARK-18390 > URL: https://issues.apache.org/jira/browse/SPARK-18390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Xiangrui Meng > > {code} > val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) > val df3 = spark.range(1e9.toInt) > df3.join(df2, df3("id") === df2("one")).count() > {code} > throws > {noformat} > org.apache.spark.sql.AnalysisException: Cartesian joins could be > prohibitively expensive and are disabled by default. To explicitly enable > them, please set spark.sql.crossJoin.enabled = true; > {noformat} > This is probably not the right behavior because it was not the user who > suggested using cartesian product. SQL picked it while knowing it is not > enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled
Xiangrui Meng created SPARK-18390: - Summary: Optimized plan tried to use Cartesian join when it is not enabled Key: SPARK-18390 URL: https://issues.apache.org/jira/browse/SPARK-18390 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1 Reporter: Xiangrui Meng {code} val df2 = spark.range(1e9.toInt).withColumn("one", lit(1)) val df3 = spark.range(1e9.toInt) df3.join(df2, df3("id") === df2("one")).count() {code} throws {noformat} org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; {noformat} This is probably not the right behavior because it was not the user who suggested using cartesian product. SQL picked it while knowing it is not enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14241. --- Resolution: Fixed > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > Fix For: 2.0.0 > > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630024#comment-15630024 ] Xiangrui Meng edited comment on SPARK-14241 at 11/2/16 7:05 PM: This bug should be fixed in 2.0 already in SPARK-13473 since we don't swap filter and nondeterministic expressions in plan optimization. was (Author: mengxr): This bug should be fixed in 2.0 already since we don't swap filter and nondeterministic expressions in plan optimization. > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > Fix For: 2.0.0 > > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14241: -- Fix Version/s: 2.0.0 > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > Fix For: 2.0.0 > > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630024#comment-15630024 ] Xiangrui Meng commented on SPARK-14241: --- This bug should be fixed in 2.0 already since we don't swap filter and nondeterministic expressions in plan optimization. > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-14393: - Assignee: Xiangrui Meng > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.0.1 >Reporter: Jason Piper >Assignee: Xiangrui Meng > Labels: correctness > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14393: -- Labels: correctness (was: ) > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jason Piper > Labels: correctness > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584761#comment-15584761 ] Xiangrui Meng edited comment on SPARK-14393 at 10/18/16 7:43 AM: - This is a bigger issue. It would happen with (`monotonically_increasing_id`, `rand`, `randn`, etc) x (`coalesce`, `union`, etc). The root cause is that the partition ID used to initialize the operator is not the partition ID associated with the DataFrame where the column was originally defined, which is expected by users. cc [~r...@databricks.com] [~yhuai] was (Author: mengxr): This is a bigger issue. It would happen with {`monotonically_increasing_id`, `rand`, `randn`, etc} x {`coalesce`, `union`, etc}. The root cause is that the partition ID used to initialize the operator is not the partition ID associated with the DataFrame where the column was originally defined, which is expected by users. cc [~r...@databricks.com] [~yhuai] > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jason Piper > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584761#comment-15584761 ] Xiangrui Meng commented on SPARK-14393: --- This is a bigger issue. It would happen with {`monotonically_increasing_id`, `rand`, `randn`, etc} x {`coalesce`, `union`, etc}. The root cause is that the partition ID used to initialize the operator is not the partition ID associated with the DataFrame where the column was originally defined, which is expected by users. cc [~r...@databricks.com] [~yhuai] > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jason Piper > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17716) Hidden Markov Model (HMM)
Xiangrui Meng created SPARK-17716: - Summary: Hidden Markov Model (HMM) Key: SPARK-17716 URL: https://issues.apache.org/jira/browse/SPARK-17716 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Runxin Li Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him to list popular HMM applications, describe public API (params, input/output schemas), compare its API with existing HMM implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623 ] Xiangrui Meng edited comment on SPARK-17647 at 9/26/16 5:07 PM: Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping behaviors in MySQL/PostgreSQL are documented here: * MySQL: http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like * PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html In particular, MySQL: {noformat} Exception: At the end of the pattern string, backslash can be specified as “\\”. At the end of the string, backslash stands for itself because there is nothing following to escape. {noformat} That explains why MySQL returns true for both {code} '\\' like '' '\\' like '\\' {code} was (Author: mengxr): Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping behaviors in MySQL/PostgreSQL are documented here: * MySQL: http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like * PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html In particular, MySQL: {noformat} Exception: At the end of the pattern string, backslash can be specified as “\\”. At the end of the string, backslash stands for itself because there is nothing following to escape. {noformat} That explains why MySQL returns true for both `\\` like `` and `\\` like `\\`. > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623 ] Xiangrui Meng commented on SPARK-17647: --- Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping behaviors in MySQL/PostgreSQL are documented here: * MySQL: http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like * PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html In particular, MySQL: {noformat} Exception: At the end of the pattern string, backslash can be specified as “\\”. At the end of the string, backslash stands for itself because there is nothing following to escape. Suppose that a table contains the following values: {noformat} That explains why MySQL returns true for both `\\` like `` and `\\` like `\\`. > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623 ] Xiangrui Meng edited comment on SPARK-17647 at 9/26/16 5:06 PM: Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping behaviors in MySQL/PostgreSQL are documented here: * MySQL: http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like * PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html In particular, MySQL: {noformat} Exception: At the end of the pattern string, backslash can be specified as “\\”. At the end of the string, backslash stands for itself because there is nothing following to escape. {noformat} That explains why MySQL returns true for both `\\` like `` and `\\` like `\\`. was (Author: mengxr): Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping behaviors in MySQL/PostgreSQL are documented here: * MySQL: http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like * PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html In particular, MySQL: {noformat} Exception: At the end of the pattern string, backslash can be specified as “\\”. At the end of the string, backslash stands for itself because there is nothing following to escape. Suppose that a table contains the following values: {noformat} That explains why MySQL returns true for both `\\` like `` and `\\` like `\\`. > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Summary: SQL LIKE do not handle backslashes correctly (was: SQL LIKE/RLIKE do not handle backslashes correctly) > SQL LIKE do not handle backslashes correctly > > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Description: Try the following in SQL shell: {code} select '' like '%\\%'; {code} It returned false, which is wrong. cc: [~yhuai] [~joshrosen] A false-negative considered previously: {code} select '' rlike '.*.*'; {code} It returned true, which is correct if we assume that the pattern is treated as a Java string but not raw string. was: Try the following in SQL shell: {code} select '' like '%\\%'; {code} It returned false, which is wrong. cc: [~yhuai] [~joshrosen] A false-negative considered previously): {code} select '' rlike '.*.*'; {code} It returned true, which is correct if we assume that the pattern is treated as a Java string but not raw string. > SQL LIKE/RLIKE do not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE does not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Summary: SQL LIKE does not handle backslashes correctly (was: SQL LIKE do not handle backslashes correctly) > SQL LIKE does not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously: > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Description: Try the following in SQL shell: {code} select '' like '%\\%'; {code} It returned false, which is wrong. cc: [~yhuai] [~joshrosen] A false-negative considered previously): {code} select '' rlike '.*.*'; {code} It returned true, which is correct if we assume that the pattern is treated as a Java string but not raw string. was: Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. cc: [~yhuai] [~joshrosen] > SQL LIKE/RLIKE do not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > {code} > It returned false, which is wrong. > cc: [~yhuai] [~joshrosen] > A false-negative considered previously): > {code} > select '' rlike '.*.*'; > {code} > It returned true, which is correct if we assume that the pattern is treated > as a Java string but not raw string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Labels: correctness (was: ) > SQL LIKE/RLIKE do not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > select '' rlike '.*.*'; > {code} > The first returned false and the second returned true. Both are wrong. > cc: [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Description: Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. cc: [~yhuai] [~joshrosen] was: Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. cc: [~yhuai] > SQL LIKE/RLIKE do not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > Labels: correctness > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > select '' rlike '.*.*'; > {code} > The first returned false and the second returned true. Both are wrong. > cc: [~yhuai] [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
Xiangrui Meng created SPARK-17647: - Summary: SQL LIKE/RLIKE do not handle backslashes correctly Key: SPARK-17647 URL: https://issues.apache.org/jira/browse/SPARK-17647 Project: Spark Issue Type: Bug Components: SQL Reporter: Xiangrui Meng Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly
[ https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17647: -- Description: Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. cc: [~yhuai] was: Try the following in SQL shell: {code} select '' like '%\\%'; select '' rlike '.*.*'; {code} The first returned false and the second returned true. Both are wrong. > SQL LIKE/RLIKE do not handle backslashes correctly > -- > > Key: SPARK-17647 > URL: https://issues.apache.org/jira/browse/SPARK-17647 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xiangrui Meng > > Try the following in SQL shell: > {code} > select '' like '%\\%'; > select '' rlike '.*.*'; > {code} > The first returned false and the second returned true. Both are wrong. > cc: [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17641) collect_set should ignore null values
[ https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17641: -- Description: `collect_set` throws the following exception when there are null values. It should ignore null values to be consistent with other aggregation methods. {code} select collect_set(null) from (select 1) tmp; java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements. at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) {code} cc: [~yhuai] was: `collect_set` throws the following exception when there are null values. It should ignore null values to be consistent with other aggregation methods. {code} java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements. at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) {code} cc: [~yhuai] > collect_set should ignore null values > - > > Key: SPARK-17641 > URL: https://issues.apache.org/jira/browse/SPARK-17641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > `collect_set` throws the following exception when there are null values. It > should ignore null values to be consistent with other aggregation methods. > {code} > select collect_set(null) from (select 1) tmp; > java.lang.IllegalArgumentException: Flat hash tables cannot contain null > elements. > at > scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHas
[jira] [Updated] (SPARK-17641) collect_set should ignore null values
[ https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-17641: -- Description: `collect_set` throws the following exception when there are null values. It should ignore null values to be consistent with other aggregation methods. {code} java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements. at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) {code} cc: [~yhuai] was: `collect_set` throws the following exception when there are null values. It should ignore null values to be consistent with other aggregation methods. {code} java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements. at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) {code} > collect_set should ignore null values > - > > Key: SPARK-17641 > URL: https://issues.apache.org/jira/browse/SPARK-17641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > `collect_set` throws the following exception when there are null values. It > should ignore null values to be consistent with other aggregation methods. > {code} > java.lang.IllegalArgumentException: Flat hash tables cannot contain null > elements. > at > scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) > at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) > at > sc
[jira] [Created] (SPARK-17641) collect_set should ignore null values
Xiangrui Meng created SPARK-17641: - Summary: collect_set should ignore null values Key: SPARK-17641 URL: https://issues.apache.org/jira/browse/SPARK-17641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiangrui Meng `collect_set` throws the following exception when there are null values. It should ignore null values to be consistent with other aggregation methods. {code} java.lang.IllegalArgumentException: Flat hash tables cannot contain null elements. at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390) at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942 ] Xiangrui Meng edited comment on SPARK-16578 at 8/22/16 12:47 AM: - [~shivaram] I had an offline discussion with [~junyangq] and I feel that we might have some misunderstanding of user scenarios. The old workflow for SparkR is the following: 1. Users download and install Spark distribution by themselves. 2. Users let R know where to find the SparkR package on local. 3. `library(SparkR)` 4. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. And the ideal workflow is the following: 1. install.packages("SparkR") from CRAN and then `library(SparkR)` 2. optionally `install.spark` 3. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. So the way we run spark-submit, RBackend, and R process, and create the SparkContext doesn't really change. They are still running on the same machine (e.g., user's laptop). So it is not necessary to make RBackend running remotely for this scenario. Having RBackend running remotely is a new Spark deployment mode and I think it requires more design and discussions. was (Author: mengxr): [~shivaram] I had an offline discussion with [~junyangq] and I feel that we might have some misunderstanding of user scenarios. The old workflow for SparkR is the following: 1. Users download and install Spark distribution by themselves. 2. Users let R know where to find the SparkR package on local. 3. `library(SparkR)` 4. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. And the ideal workflow is the following: 1. install.packages("SparkR") from CRAN 2. optionally `install.spark` 3. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. So the way we run spark-submit, RBackend, and R process, and create the SparkContext doesn't really change. They are still running on the same machine (e.g., user's laptop). So it is not necessary to make RBackend running remotely for this scenario. Having RBackend running remotely is a new Spark deployment mode and I think it requires more design and discussions. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942 ] Xiangrui Meng edited comment on SPARK-16578 at 8/22/16 12:46 AM: - [~shivaram] I had an offline discussion with [~junyangq] and I feel that we might have some misunderstanding of user scenarios. The old workflow for SparkR is the following: 1. Users download and install Spark distribution by themselves. 2. Users let R know where to find the SparkR package on local. 3. `library(SparkR)` 4. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. And the ideal workflow is the following: 1. install.packages("SparkR") from CRAN 2. optionally `install.spark` 3. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. So the way we run spark-submit, RBackend, and R process, and create the SparkContext doesn't really change. They are still running on the same machine (e.g., user's laptop). So it is not necessary to make RBackend running remotely for this scenario. Having RBackend running remotely is a new Spark deployment mode and I think it requires more design and discussions. was (Author: mengxr): [~shivaram] I had an offline discussion with [~junyangq] and I feel that we might have some misunderstanding of user scenarios. The old workflow for SparkR is the following: 1. Users download and install Spark distribution by themselves. 2. Users let R know where to find the SparkR package on local. 3. `library(SparkR)` 4. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. And the ideal workflow is the following: 1. install.packages("SparkR") 2. optionally `install.spark` 3. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. So the way we run spark-submit, RBackend, and R process, and create the SparkContext doesn't really change. They are still running on the same machine (e.g., user's laptop). So it is not necessary to make RBackend running remotely for this scenario. Having RBackend running remotely is a new Spark deployment mode and I think it requires more design and discussions. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942 ] Xiangrui Meng commented on SPARK-16578: --- [~shivaram] I had an offline discussion with [~junyangq] and I feel that we might have some misunderstanding of user scenarios. The old workflow for SparkR is the following: 1. Users download and install Spark distribution by themselves. 2. Users let R know where to find the SparkR package on local. 3. `library(SparkR)` 4. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. And the ideal workflow is the following: 1. install.packages("SparkR") 2. optionally `install.spark` 3. Launch driver/SparkContext (in client mode) and connect to a local or remote cluster. So the way we run spark-submit, RBackend, and R process, and create the SparkContext doesn't really change. They are still running on the same machine (e.g., user's laptop). So it is not necessary to make RBackend running remotely for this scenario. Having RBackend running remotely is a new Spark deployment mode and I think it requires more design and discussions. > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16578: -- Assignee: Junyang Qian > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16443) ALS wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16443. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14384 [https://github.com/apache/spark/pull/14384] > ALS wrapper in SparkR > - > > Key: SPARK-16443 > URL: https://issues.apache.org/jira/browse/SPARK-16443 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Fix For: 2.1.0 > > > Wrap MLlib's ALS in SparkR. We should discuss whether we want to support R > formula or not for ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16446. --- Resolution: Fixed Fix Version/s: 2.1.0 > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394642#comment-15394642 ] Xiangrui Meng commented on SPARK-16445: --- [~iamshrek] Any updates? > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394639#comment-15394639 ] Xiangrui Meng commented on SPARK-16446: --- [~yanboliang] Any updates? > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16444: -- Shepherd: Junyang Qian > Isotonic Regression wrapper in SparkR > - > > Key: SPARK-16444 > URL: https://issues.apache.org/jira/browse/SPARK-16444 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Miao Wang > > Implement Isotonic Regression wrapper and other utils in SparkR. > {code} > spark.isotonicRegression(data, formula, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16579) Add a spark install function
[ https://issues.apache.org/jira/browse/SPARK-16579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16579: -- Assignee: Junyang Qian > Add a spark install function > > > Key: SPARK-16579 > URL: https://issues.apache.org/jira/browse/SPARK-16579 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > As described in the design doc we need to introduce a function to install > Spark in case the user directly downloads SparkR from CRAN. > To do that we can introduce a install_spark function that takes in the > following arguments > {code} > hadoop_version > url_to_use # defaults to apache > local_dir # defaults to a cache dir > {code} > Further more I think we can automatically run this from sparkR.init if we > find Spark home and the JARs missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16538) Cannot use "SparkR::sql"
[ https://issues.apache.org/jira/browse/SPARK-16538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16538: -- Fix Version/s: 1.6.3 > Cannot use "SparkR::sql" > > > Key: SPARK-16538 > URL: https://issues.apache.org/jira/browse/SPARK-16538 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.2, 2.0.0 >Reporter: Weiluo Ren >Assignee: Felix Cheung >Priority: Critical > Fix For: 1.6.3, 2.0.0 > > > When call "SparkR::sql", an error pops up. For instance > {code} > SparkR::sql("") > Error in get(paste0(funcName, ".default")) : > object '::.default' not found > {code} > https://github.com/apache/spark/blob/f4767bcc7a9d1bdd301f054776aa45e7c9f344a7/R/pkg/R/SQLContext.R#L51 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16447) LDA wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16447: -- Assignee: Xusen Yin > LDA wrapper in SparkR > - > > Key: SPARK-16447 > URL: https://issues.apache.org/jira/browse/SPARK-16447 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371384#comment-15371384 ] Xiangrui Meng commented on SPARK-16447: --- Assigned. Thanks! > LDA wrapper in SparkR > - > > Key: SPARK-16447 > URL: https://issues.apache.org/jira/browse/SPARK-16447 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371382#comment-15371382 ] Xiangrui Meng commented on SPARK-16445: --- The target version is 2.1.0. So no strict deadline but thanks for asking! > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16445: -- Assignee: Xin Ren > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16444: -- Assignee: Miao Wang > Isotonic Regression wrapper in SparkR > - > > Key: SPARK-16444 > URL: https://issues.apache.org/jira/browse/SPARK-16444 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Miao Wang > > Implement Isotonic Regression wrapper and other utils in SparkR. > {code} > spark.isotonicRegression(data, formula, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368547#comment-15368547 ] Xiangrui Meng commented on SPARK-15767: --- This was discussed in SPARK-14831. We should call it `spark.algo(data, formula, method, required params, [optional params])` and use the same param names as in MLlib. But I'm not sure what method name to use here. We should think about method names for all tree methods together. cc [~josephkb] > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367937#comment-15367937 ] Xiangrui Meng commented on SPARK-16446: --- [~yanboliang] Do you have time to work on this? > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
Xiangrui Meng created SPARK-16446: - Summary: Gaussian Mixture Model wrapper in SparkR Key: SPARK-16446 URL: https://issues.apache.org/jira/browse/SPARK-16446 Project: Spark Issue Type: Sub-task Reporter: Xiangrui Meng Follow instructions in SPARK-16442 and implement Gaussian Mixture Model wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16447) LDA wrapper in SparkR
Xiangrui Meng created SPARK-16447: - Summary: LDA wrapper in SparkR Key: SPARK-16447 URL: https://issues.apache.org/jira/browse/SPARK-16447 Project: Spark Issue Type: Sub-task Reporter: Xiangrui Meng Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15767: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-16442 > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org