from:"Xiangrui Meng \(JIRA\)"

[jira] [Updated] (SPARK-20771) Usability issues with weekofyear()

2017-05-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-20771:
--
Description: 
The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation or ordering 
wrong. A better implementation should return the year number of the week as 
well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.

  was:
The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation wrong. A better 
implementation should return the year number of the week as well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.


> Usability issues with weekofyear()
> --
>
> Key: SPARK-20771
> URL: https://issues.apache.org/jira/browse/SPARK-20771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> The weekofyear() implementation follows HIVE / ISO 8601 week number. However 
> it is not useful because it doesn't return the year of the week start. For 
> example,
> weekofyear("2017-01-01") returns 52
> Anyone using this with groupBy('week) might do the aggregation or ordering 
> wrong. A better implementation should return the year number of the week as 
> well.
> MySQL's yearweek() is much better in this sense: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.
> Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20771) Usability issues with weekofyear()

2017-05-16 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-20771:
-

 Summary: Usability issues with weekofyear()
 Key: SPARK-20771
 URL: https://issues.apache.org/jira/browse/SPARK-20771
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiangrui Meng
Priority: Minor


The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation wrong. A better 
implementation should return the year number of the week as well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2017-03-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-20129:
-

 Summary: JavaSparkContext should use SparkContext.getOrCreate
 Key: SPARK-20129
 URL: https://issues.apache.org/jira/browse/SPARK-20129
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 2.1.0
Reporter: Xiangrui Meng


It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2017-03-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-20129:
-

Assignee: Xiangrui Meng

> JavaSparkContext should use SparkContext.getOrCreate
> 
>
> Key: SPARK-20129
> URL: https://issues.apache.org/jira/browse/SPARK-20129
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-27 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-20088:
-

Assignee: Hossein Falaki

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.2.0
>
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-27 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-20088.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17423
[https://github.com/apache/spark/pull/17423]

> Do not create new SparkContext in SparkR createSparkContext
> ---
>
> Key: SPARK-20088
> URL: https://issues.apache.org/jira/browse/SPARK-20088
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
> Fix For: 2.2.0
>
>
> In the implementation of {{createSparkContext}}, we are calling 
> {code}
>  new JavaSparkContext()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2017-02-23 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880786#comment-15880786
 ] 

Xiangrui Meng commented on SPARK-5226:
--

I closed this ticket as "Won't Do" due to DBSCAN's high complexity and hence 
bad scalability as documented in 
http://staff.itee.uq.edu.au/taoyf/paper/sigmod15-dbscan.pdf.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2017-02-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5226.

Resolution: Won't Fix

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-02-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858963#comment-15858963
 ] 

Xiangrui Meng commented on SPARK-18924:
---

I'm going to work on this one. So removed myself from "shepherd".

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-02-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18924:
--
Shepherd:   (was: Xiangrui Meng)

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-02-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18924:
-

Assignee: Xiangrui Meng

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18924:
--
Description: 
SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for improvement, discussed below.

2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
which collects rows to local and then constructs columns. However,
* it ignores column types and results boxing/unboxing overhead
* it collects all objects to driver and results high GC pressure

A relatively simple change is to implement specialized column builder based on 
column types, primitive types in particular. We need to handle null/NA values 
properly. A simple data structure we can use is

{code}
val size: Int
val nullIndexes: Array[Int]
val notNullValues: Array[T] // specialized for primitive types
{code}

On the R side, we can use `readBin` and `writeBin` to read the entire vector in 
a single method call. The speed seems reasonable (at the order of GB/s):

{code}
> x <- runif(1000) # 1e7, not 1e6
> system.time(r <- writeBin(x, raw(0)))
   user  system elapsed
  0.036   0.021   0.059
> > system.time(y <- readBin(r, double(), 1000))
   user  system elapsed
  0.015   0.007   0.024
{code}

This is just a proposal that needs to be discussed and formalized. But in 
general, it should be feasible to obtain 20x or more performance gain.

  was:
SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round-trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for imp

[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760439#comment-15760439
 ] 

Xiangrui Meng commented on SPARK-18924:
---

cc: [~shivaram] [~felixcheung] [~falaki] [~yanboliang] for discussion.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round-trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collect rows to local  and then construct columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null values 
> properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-18 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18924:
-

 Summary: Improve collect/createDataFrame performance in SparkR
 Key: SPARK-18924
 URL: https://issues.apache.org/jira/browse/SPARK-18924
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Xiangrui Meng
Priority: Critical


SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round-trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for improvement, discussed below.

2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
which collect rows to local  and then construct columns. However,
* it ignores column types and results boxing/unboxing overhead
* it collects all objects to driver and results high GC pressure

A relatively simple change is to implement specialized column builder based on 
column types, primitive types in particular. We need to handle null values 
properly. A simple data structure we can use is

{code}
val size: Int
val nullIndexes: Array[Int]
val notNullValues: Array[T] // specialized for primitive types
{code}

On the R side, we can use `readBin` and `writeBin` to read the entire vector in 
a single method call. The speed seems reasonable (at the order of GB/s):

{code}
> x <- runif(1000) # 1e7, not 1e6
> system.time(r <- writeBin(x, raw(0)))
   user  system elapsed
  0.036   0.021   0.059
> > system.time(y <- readBin(r, double(), 1000))
   user  system elapsed
  0.015   0.007   0.024
{code}

This is just a proposal that needs to be discussed and formalized. But in 
general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18849:
--
Description: 
Make a final pass over the vignettes and ensure the content is consistent.

* remove "since version" because is not that useful for vignettes
* re-order/group the list of ML algorithms so there exists a logical ordering
* check for warning or error in output message
* anything else that seems out of place

  was:
Make a final pass over the vignettes and ensure the content is consistent.

* remove "since version" because is not that useful for vignettes
* re-order/group the list of ML algorithms so there exists a logical ordering
* check for warning or error in output message
* 


> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18849:
--
Description: 
Make a final pass over the vignettes and ensure the content is consistent.

* remove "since version" because is not that useful for vignettes
* re-order/group the list of ML algorithms so there exists a logical ordering
* check for warning or error in output message
* 

  was:
Make a final pass over the vignettes and ensure the content is consistent.

* remove "since version" because is not that useful for vignettes
* re-order/group the list of ML algorithms so there exists a logical ordering
* ?


> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18849:
--
Assignee: Felix Cheung

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18793) SparkR vignette update: random forest

2016-12-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18793.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16264
[https://github.com/apache/spark/pull/16264]

> SparkR vignette update: random forest
> -
>
> Key: SPARK-18793
> URL: https://issues.apache.org/jira/browse/SPARK-18793
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
> Fix For: 2.1.1, 2.2.0
>
>
> Update vignettes to cover randomForest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18794) SparkR vignette update: gbt

2016-12-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18794.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16264
[https://github.com/apache/spark/pull/16264]

> SparkR vignette update: gbt
> ---
>
> Key: SPARK-18794
> URL: https://issues.apache.org/jira/browse/SPARK-18794
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
> Fix For: 2.1.1, 2.2.0
>
>
> Update vignettes to cover gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-13 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18849:
-

 Summary: Vignettes final checks for Spark 2.1
 Key: SPARK-18849
 URL: https://issues.apache.org/jira/browse/SPARK-18849
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SparkR
Reporter: Xiangrui Meng


Make a final pass over the vignettes and ensure the content is consistent.

* remove "since version" because is not that useful for vignettes
* re-order/group the list of ML algorithms so there exists a logical ordering
* ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18795:
--
Assignee: Miao Wang  (was: Xiangrui Meng)

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18793) SparkR vignette update: random forest

2016-12-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18793:
-

Assignee: Xiangrui Meng

> SparkR vignette update: random forest
> -
>
> Key: SPARK-18793
> URL: https://issues.apache.org/jira/browse/SPARK-18793
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover randomForest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744359#comment-15744359
 ] 

Xiangrui Meng commented on SPARK-18795:
---

[~wangmiao1981] Any updates?

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18794) SparkR vignette update: gbt

2016-12-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18794:
-

Assignee: Xiangrui Meng

> SparkR vignette update: gbt
> ---
>
> Key: SPARK-18794
> URL: https://issues.apache.org/jira/browse/SPARK-18794
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18795:
-

Assignee: Xiangrui Meng

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18797.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16222
[https://github.com/apache/spark/pull/16222]

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18812) Clarify "Spark ML"

2016-12-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18812.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16241
[https://github.com/apache/spark/pull/16241]

> Clarify "Spark ML"
> --
>
> Key: SPARK-18812
> URL: https://issues.apache.org/jira/browse/SPARK-18812
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.1.1, 2.2.0
>
>
> It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18812) Clarify "Spark ML"

2016-12-09 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18812:
-

 Summary: Clarify "Spark ML"
 Key: SPARK-18812
 URL: https://issues.apache.org/jira/browse/SPARK-18812
 Project: Spark
  Issue Type: Documentation
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17822:
--
Fix Version/s: 2.0.3

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-17822.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16154
[https://github.com/apache/spark/pull/16154]

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Fix For: 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-12-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734678#comment-15734678
 ] 

Xiangrui Meng commented on SPARK-17647:
---

[~r...@databricks.com] [~yhuai] I think this is a critical correctness bug, 
which should be fixed in 2.1. Thoughts?

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18792.
---
Resolution: Duplicate

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734024#comment-15734024
 ] 

Xiangrui Meng commented on SPARK-18792:
---

[~wangmiao1981] Please check existing sub-tasks before creating new ones. I'm 
closing mine.

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18792:
-

Assignee: Xiangrui Meng

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17823) Make JVMObjectTracker.objMap thread-safe

2016-12-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-17823.
---
Resolution: Duplicate

This is contained by SPARK-17822.

> Make JVMObjectTracker.objMap thread-safe
> 
>
> Key: SPARK-17823
> URL: https://issues.apache.org/jira/browse/SPARK-17823
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Since JVMObjectTracker.objMap is a global map, it makes sense to make it 
> thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727981#comment-15727981
 ] 

Xiangrui Meng commented on SPARK-18762:
---

Thanks! Please make sure spark history server still works  when ssl is enabled.

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Description: 
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this introduces several broken links in the UI. For example, 
in the master UI, the worker link is https:8081 instead of http:8081 or 
https:8481.

  was:
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this cause several broken links in the UI. For example, in 
the master UI, the worker link is https:8081 instead of http:8081 or https:8481.


> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727929#comment-15727929
 ] 

Xiangrui Meng edited comment on SPARK-18762 at 12/7/16 6:56 AM:


cc [~hayashidac] [~sarutak] [~lian cheng]


was (Author: mengxr):
cc [~hayashidac] [~sarutak]

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727929#comment-15727929
 ] 

Xiangrui Meng commented on SPARK-18762:
---

cc [~hayashidac] [~sarutak]

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Description: 
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this cause several broken links in the UI. For example, in 
the master UI, the worker link is https:8081 instead of http:8081 or https:8481.

  was:
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.


> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Priority: Blocker  (was: Critical)

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Priority: Critical  (was: Major)

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18762:
-

 Summary: Web UI should be http:4040 instead of https:4040
 Key: SPARK-18762
 URL: https://issues.apache.org/jira/browse/SPARK-18762
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Web UI
Affects Versions: 2.1.0
Reporter: Xiangrui Meng


When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-05 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722886#comment-15722886
 ] 

Xiangrui Meng commented on SPARK-17822:
---

The issue comes with multiple RBackend connections. It is feasible to create 
multiple RBackend sessions. But they share the same `JVMObjectTracker`. It 
cannot tell which JVM object is from which RBackend. If an RBackend died 
without proper cleaning, we got a memory leak.

I will send a PR to make JVMObjectTracker a member variable of RBackend. There 
should be more TODOs to allow concurrent RBackend sessions. But this would help 
solve the most critical issue.

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-17822:
-

Assignee: Xiangrui Meng

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-02 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15716690#comment-15716690
 ] 

Xiangrui Meng commented on SPARK-17822:
---

I will take a look.

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-29 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15707642#comment-15707642
 ] 

Xiangrui Meng commented on SPARK-18374:
---

See the discussion here: https://github.com/nltk/nltk_data/issues/22. Including 
`won` is apparently a mistake.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18317:
--
Attachment: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html

I checked the result from japi-compliance-checker. All binary incompatible 
changes reported are either private or package private. So we are good to go.

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
> Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18317.
---
Resolution: Done

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
> Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18317:
-

Assignee: Xiangrui Meng

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652366#comment-15652366
 ] 

Xiangrui Meng commented on SPARK-18390:
---

This is a bug because the user didn't ask a cartesian join. Anyway, this was 
fixed.

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18390.
---
Resolution: Duplicate

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18390:
--
Description: 
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{code}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{code}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> I hit this error when I tried to test skewed joins.
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {code}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {code}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18390:
--
Description: 
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> I hit this error when I tried to test skewed joins.
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {noformat}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {noformat}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651859#comment-15651859
 ] 

Xiangrui Meng commented on SPARK-18390:
---

cc: [~yhuai] [~lian cheng]

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {noformat}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {noformat}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-18390:
-

 Summary: Optimized plan tried to use Cartesian join when it is not 
enabled
 Key: SPARK-18390
 URL: https://issues.apache.org/jira/browse/SPARK-18390
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1
Reporter: Xiangrui Meng


{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-11-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14241.
---
Resolution: Fixed

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
> Fix For: 2.0.0
>
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-11-02 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630024#comment-15630024
 ] 

Xiangrui Meng edited comment on SPARK-14241 at 11/2/16 7:05 PM:


This bug should be fixed in 2.0 already in SPARK-13473 since we don't swap 
filter and nondeterministic expressions in plan optimization.


was (Author: mengxr):
This bug should be fixed in 2.0 already since we don't swap filter and 
nondeterministic expressions in plan optimization.

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
> Fix For: 2.0.0
>
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-11-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14241:
--
Fix Version/s: 2.0.0

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
> Fix For: 2.0.0
>
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-11-02 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630024#comment-15630024
 ] 

Xiangrui Meng commented on SPARK-14241:
---

This bug should be fixed in 2.0 already since we don't swap filter and 
nondeterministic expressions in plan optimization.

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-14393:
-

Assignee: Xiangrui Meng

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.0.1
>Reporter: Jason Piper
>Assignee: Xiangrui Meng
>  Labels: correctness
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14393:
--
Labels: correctness  (was: )

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>  Labels: correctness
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584761#comment-15584761
 ] 

Xiangrui Meng edited comment on SPARK-14393 at 10/18/16 7:43 AM:
-

This is a bigger issue. It would happen with (`monotonically_increasing_id`, 
`rand`, `randn`, etc) x (`coalesce`, `union`, etc). The root cause is that the 
partition ID used to initialize the operator is not the partition ID associated 
with the DataFrame where the column was originally defined, which is expected 
by users.

cc [~r...@databricks.com] [~yhuai]


was (Author: mengxr):
This is a bigger issue. It would happen with {`monotonically_increasing_id`, 
`rand`, `randn`, etc} x {`coalesce`, `union`, etc}. The root cause is that the 
partition ID used to initialize the operator is not the partition ID associated 
with the DataFrame where the column was originally defined, which is expected 
by users.

cc [~r...@databricks.com] [~yhuai]

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584761#comment-15584761
 ] 

Xiangrui Meng commented on SPARK-14393:
---

This is a bigger issue. It would happen with {`monotonically_increasing_id`, 
`rand`, `randn`, etc} x {`coalesce`, `union`, etc}. The root cause is that the 
partition ID used to initialize the operator is not the partition ID associated 
with the DataFrame where the column was originally defined, which is expected 
by users.

cc [~r...@databricks.com] [~yhuai]

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17716) Hidden Markov Model (HMM)

2016-09-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-17716:
-

 Summary: Hidden Markov Model (HMM)
 Key: SPARK-17716
 URL: https://issues.apache.org/jira/browse/SPARK-17716
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Runxin Li


Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at 
https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him 
to list popular HMM applications, describe public API (params, input/output 
schemas), compare its API with existing HMM implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623
 ] 

Xiangrui Meng edited comment on SPARK-17647 at 9/26/16 5:07 PM:


Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping 
behaviors in MySQL/PostgreSQL are documented here:

* MySQL: 
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like
* PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html

In particular, MySQL:

{noformat}
Exception: At the end of the pattern string, backslash can be specified as “\\”.
At the end of the string, backslash stands for itself because there is nothing 
following to escape.
{noformat}

That explains why MySQL returns true for both

{code}
'\\' like ''
'\\' like '\\'
{code}


was (Author: mengxr):
Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping 
behaviors in MySQL/PostgreSQL are documented here:

* MySQL: 
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like
* PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html

In particular, MySQL:

{noformat}
Exception: At the end of the pattern string, backslash can be specified as “\\”.
At the end of the string, backslash stands for itself because there is nothing 
following to escape.
{noformat}

That explains why MySQL returns true for both `\\` like `` and `\\` like 
`\\`.

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623
 ] 

Xiangrui Meng commented on SPARK-17647:
---

Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping 
behaviors in MySQL/PostgreSQL are documented here:

* MySQL: 
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like
* PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html

In particular, MySQL:

{noformat}
Exception: At the end of the pattern string, backslash can be specified as 
“\\”. At the end of the string, backslash stands for itself because there is 
nothing following to escape. Suppose that a table contains the following values:
{noformat}

That explains why MySQL returns true for both `\\` like `` and `\\` like 
`\\`.

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523623#comment-15523623
 ] 

Xiangrui Meng edited comment on SPARK-17647 at 9/26/16 5:06 PM:


Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping 
behaviors in MySQL/PostgreSQL are documented here:

* MySQL: 
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like
* PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html

In particular, MySQL:

{noformat}
Exception: At the end of the pattern string, backslash can be specified as “\\”.
At the end of the string, backslash stands for itself because there is nothing 
following to escape.
{noformat}

That explains why MySQL returns true for both `\\` like `` and `\\` like 
`\\`.


was (Author: mengxr):
Thanks [~joshrosen]! I updated the JIRA description. The LIKE escaping 
behaviors in MySQL/PostgreSQL are documented here:

* MySQL: 
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like
* PostgreSQL: https://www.postgresql.org/docs/8.3/static/functions-matching.html

In particular, MySQL:

{noformat}
Exception: At the end of the pattern string, backslash can be specified as 
“\\”. At the end of the string, backslash stands for itself because there is 
nothing following to escape. Suppose that a table contains the following values:
{noformat}

That explains why MySQL returns true for both `\\` like `` and `\\` like 
`\\`.

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE do not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Summary: SQL LIKE do not handle backslashes correctly  (was: SQL LIKE/RLIKE 
do not handle backslashes correctly)

> SQL LIKE do not handle backslashes correctly
> 
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Description: 
Try the following in SQL shell:

{code}
select '' like '%\\%';
{code}

It returned false, which is wrong.


cc: [~yhuai] [~joshrosen]


A false-negative considered previously:


{code}
select '' rlike '.*.*';
{code}

It returned true, which is correct if we assume that the pattern is treated as 
a Java string but not raw string.

  was:
Try the following in SQL shell:

{code}
select '' like '%\\%';
{code}

It returned false, which is wrong.


cc: [~yhuai] [~joshrosen]


A false-negative considered previously):


{code}
select '' rlike '.*.*';
{code}

It returned true, which is correct if we assume that the pattern is treated as 
a Java string but not raw string.


> SQL LIKE/RLIKE do not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Summary: SQL LIKE does not handle backslashes correctly  (was: SQL LIKE do 
not handle backslashes correctly)

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Description: 
Try the following in SQL shell:

{code}
select '' like '%\\%';
{code}

It returned false, which is wrong.


cc: [~yhuai] [~joshrosen]


A false-negative considered previously):


{code}
select '' rlike '.*.*';
{code}

It returned true, which is correct if we assume that the pattern is treated as 
a Java string but not raw string.

  was:
Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.

cc: [~yhuai] [~joshrosen]


> SQL LIKE/RLIKE do not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously):
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Labels: correctness  (was: )

> SQL LIKE/RLIKE do not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> select '' rlike '.*.*';
> {code}
> The first returned false and the second returned true. Both are wrong.
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Description: 
Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.

cc: [~yhuai] [~joshrosen]

  was:
Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.

cc: [~yhuai]


> SQL LIKE/RLIKE do not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>  Labels: correctness
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> select '' rlike '.*.*';
> {code}
> The first returned false and the second returned true. Both are wrong.
> cc: [~yhuai] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-23 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-17647:
-

 Summary: SQL LIKE/RLIKE do not handle backslashes correctly
 Key: SPARK-17647
 URL: https://issues.apache.org/jira/browse/SPARK-17647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Xiangrui Meng


Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17647) SQL LIKE/RLIKE do not handle backslashes correctly

2016-09-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17647:
--
Description: 
Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.

cc: [~yhuai]

  was:
Try the following in SQL shell:

{code}
select '' like '%\\%';
select '' rlike '.*.*';
{code}

The first returned false and the second returned true. Both are wrong.


> SQL LIKE/RLIKE do not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> select '' rlike '.*.*';
> {code}
> The first returned false and the second returned true. Both are wrong.
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17641:
--
Description: 
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
select collect_set(null) from (select 1) tmp;

java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]

  was:
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]


> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHas

[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17641:
--
Description: 
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]

  was:
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}


> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> sc

[jira] [Created] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-17641:
-

 Summary: collect_set should ignore null values
 Key: SPARK-17641
 URL: https://issues.apache.org/jira/browse/SPARK-17641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16578) Configurable hostname for RBackend

2016-08-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942
 ] 

Xiangrui Meng edited comment on SPARK-16578 at 8/22/16 12:47 AM:
-

[~shivaram] I had an offline discussion with [~junyangq] and I feel that we 
might have some misunderstanding of user scenarios. 

The old workflow for SparkR is the following:

1. Users download and install Spark distribution by themselves.
2. Users let R know where to find the SparkR package on local.
3. `library(SparkR)`
4. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

And the ideal workflow is the following:

1. install.packages("SparkR") from CRAN and then `library(SparkR)`
2. optionally `install.spark`
3. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

So the way we run spark-submit, RBackend, and R process, and create the 
SparkContext doesn't really change. They are still running on the same machine 
(e.g., user's laptop). So it is not necessary to make RBackend running remotely 
for this scenario.

Having RBackend running remotely is a new Spark deployment mode and I think it 
requires more design and discussions.


was (Author: mengxr):
[~shivaram] I had an offline discussion with [~junyangq] and I feel that we 
might have some misunderstanding of user scenarios. 

The old workflow for SparkR is the following:

1. Users download and install Spark distribution by themselves.
2. Users let R know where to find the SparkR package on local.
3. `library(SparkR)`
4. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

And the ideal workflow is the following:

1. install.packages("SparkR") from CRAN
2. optionally `install.spark`
3. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

So the way we run spark-submit, RBackend, and R process, and create the 
SparkContext doesn't really change. They are still running on the same machine 
(e.g., user's laptop). So it is not necessary to make RBackend running remotely 
for this scenario.

Having RBackend running remotely is a new Spark deployment mode and I think it 
requires more design and discussions.

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16578) Configurable hostname for RBackend

2016-08-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942
 ] 

Xiangrui Meng edited comment on SPARK-16578 at 8/22/16 12:46 AM:
-

[~shivaram] I had an offline discussion with [~junyangq] and I feel that we 
might have some misunderstanding of user scenarios. 

The old workflow for SparkR is the following:

1. Users download and install Spark distribution by themselves.
2. Users let R know where to find the SparkR package on local.
3. `library(SparkR)`
4. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

And the ideal workflow is the following:

1. install.packages("SparkR") from CRAN
2. optionally `install.spark`
3. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

So the way we run spark-submit, RBackend, and R process, and create the 
SparkContext doesn't really change. They are still running on the same machine 
(e.g., user's laptop). So it is not necessary to make RBackend running remotely 
for this scenario.

Having RBackend running remotely is a new Spark deployment mode and I think it 
requires more design and discussions.


was (Author: mengxr):
[~shivaram] I had an offline discussion with [~junyangq] and I feel that we 
might have some misunderstanding of user scenarios. 

The old workflow for SparkR is the following:

1. Users download and install Spark distribution by themselves.
2. Users let R know where to find the SparkR package on local.
3. `library(SparkR)`
4. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

And the ideal workflow is the following:

1. install.packages("SparkR")
2. optionally `install.spark`
3. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

So the way we run spark-submit, RBackend, and R process, and create the 
SparkContext doesn't really change. They are still running on the same machine 
(e.g., user's laptop). So it is not necessary to make RBackend running remotely 
for this scenario.

Having RBackend running remotely is a new Spark deployment mode and I think it 
requires more design and discussions.

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429942#comment-15429942
 ] 

Xiangrui Meng commented on SPARK-16578:
---

[~shivaram] I had an offline discussion with [~junyangq] and I feel that we 
might have some misunderstanding of user scenarios. 

The old workflow for SparkR is the following:

1. Users download and install Spark distribution by themselves.
2. Users let R know where to find the SparkR package on local.
3. `library(SparkR)`
4. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

And the ideal workflow is the following:

1. install.packages("SparkR")
2. optionally `install.spark`
3. Launch driver/SparkContext (in client mode) and connect to a local or remote 
cluster.

So the way we run spark-submit, RBackend, and R process, and create the 
SparkContext doesn't really change. They are still running on the same machine 
(e.g., user's laptop). So it is not necessary to make RBackend running remotely 
for this scenario.

Having RBackend running remotely is a new Spark deployment mode and I think it 
requires more design and discussions.

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16578) Configurable hostname for RBackend

2016-08-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16578:
--
Assignee: Junyang Qian

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16443) ALS wrapper in SparkR

2016-08-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16443.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14384
[https://github.com/apache/spark/pull/14384]

> ALS wrapper in SparkR
> -
>
> Key: SPARK-16443
> URL: https://issues.apache.org/jira/browse/SPARK-16443
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
> Fix For: 2.1.0
>
>
> Wrap MLlib's ALS in SparkR. We should discuss whether we want to support R 
> formula or not for ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-08-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16446.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394642#comment-15394642
 ] 

Xiangrui Meng commented on SPARK-16445:
---

[~iamshrek] Any updates?

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394639#comment-15394639
 ] 

Xiangrui Meng commented on SPARK-16446:
---

[~yanboliang] Any updates?

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16444:
--
Shepherd: Junyang Qian

> Isotonic Regression wrapper in SparkR
> -
>
> Key: SPARK-16444
> URL: https://issues.apache.org/jira/browse/SPARK-16444
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Miao Wang
>
> Implement Isotonic Regression wrapper and other utils in SparkR.
> {code}
> spark.isotonicRegression(data, formula, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16579) Add a spark install function

2016-07-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16579:
--
Assignee: Junyang Qian

> Add a spark install function
> 
>
> Key: SPARK-16579
> URL: https://issues.apache.org/jira/browse/SPARK-16579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> As described in the design doc we need to introduce a function to install 
> Spark in case the user directly downloads SparkR from CRAN.
> To do that we can introduce a install_spark function that takes in the 
> following arguments
> {code}
> hadoop_version
> url_to_use # defaults to apache
> local_dir # defaults to a cache dir
> {code} 
> Further more I think we can automatically run this from sparkR.init if we 
> find Spark home and the JARs missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16538) Cannot use "SparkR::sql"

2016-07-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16538:
--
Fix Version/s: 1.6.3

> Cannot use "SparkR::sql"
> 
>
> Key: SPARK-16538
> URL: https://issues.apache.org/jira/browse/SPARK-16538
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Weiluo Ren
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 1.6.3, 2.0.0
>
>
> When call "SparkR::sql", an error pops up. For instance
> {code}
> SparkR::sql("")
> Error in get(paste0(funcName, ".default")) :
>  object '::.default' not found
> {code}
> https://github.com/apache/spark/blob/f4767bcc7a9d1bdd301f054776aa45e7c9f344a7/R/pkg/R/SQLContext.R#L51



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16447) LDA wrapper in SparkR

2016-07-11 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16447:
--
Assignee: Xusen Yin

> LDA wrapper in SparkR
> -
>
> Key: SPARK-16447
> URL: https://issues.apache.org/jira/browse/SPARK-16447
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR

2016-07-11 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371384#comment-15371384
 ] 

Xiangrui Meng commented on SPARK-16447:
---

Assigned. Thanks!

> LDA wrapper in SparkR
> -
>
> Key: SPARK-16447
> URL: https://issues.apache.org/jira/browse/SPARK-16447
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-11 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371382#comment-15371382
 ] 

Xiangrui Meng commented on SPARK-16445:
---

The target version is 2.1.0. So no strict deadline but thanks for asking!

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-11 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16445:
--
Assignee: Xin Ren

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR

2016-07-11 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16444:
--
Assignee: Miao Wang

> Isotonic Regression wrapper in SparkR
> -
>
> Key: SPARK-16444
> URL: https://issues.apache.org/jira/browse/SPARK-16444
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Miao Wang
>
> Implement Isotonic Regression wrapper and other utils in SparkR.
> {code}
> spark.isotonicRegression(data, formula, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368547#comment-15368547
 ] 

Xiangrui Meng commented on SPARK-15767:
---

This was discussed in SPARK-14831. We should call it `spark.algo(data, formula, 
method, required params, [optional params])` and use the same param names as in 
MLlib. But I'm not sure what method name to use here. We should think about 
method names for all tree methods together. cc [~josephkb]

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367937#comment-15367937
 ] 

Xiangrui Meng commented on SPARK-16446:
---

[~yanboliang] Do you have time to work on this?

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-16446:
-

 Summary: Gaussian Mixture Model wrapper in SparkR
 Key: SPARK-16446
 URL: https://issues.apache.org/jira/browse/SPARK-16446
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


Follow instructions in SPARK-16442 and implement Gaussian Mixture Model wrapper 
in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16447) LDA wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-16447:
-

 Summary: LDA wrapper in SparkR
 Key: SPARK-16447
 URL: https://issues.apache.org/jira/browse/SPARK-16447
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15767:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-16442

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 5507 matches

Mail list logo