[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023715#comment-17023715
 ] 

Hossein Falaki commented on SPARK-30629:


Yes, this is a good example. There must be a solution to avoid the old bug and 
not fail with such a case. I could not find one.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023712#comment-17023712
 ] 

Hossein Falaki commented on SPARK-30629:


Oh yes. I think we can disable that specific test. If it ever ran and passed it 
must have been because of the other bug. Thanks for catching it.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023710#comment-17023710
 ] 

Hossein Falaki commented on SPARK-30629:


[~zero323] that indefinite recursive call will never execute. So it is better 
to fail during {{cleanClosure()}} rather than execution on workers.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code}
df <- createDataFrame(data.frame(x=1))
f1 <- function(x) x + 1
f2 <- function(x) f1(x) + 2

dapplyCollect(df, function(x) { f1(x); f2(x) })
{code}

We get following error message:

{code}
org.apache.spark.SparkException: R computation failed with
 Error in f1(x) : could not find function "f1"
Calls: compute -> computeFunc -> f2
{code}

Compare that to this code block with succeeds:
{code}
dapplyCollect(df, function(x) { f2(x) })
{code}

  was:
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block that succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block that succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}

  was:
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block with succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code:java}
> library(SparkR)
> sparkR.session()
> spark_df <- createDataFrame(na.omit(airquality))
> cody_local2 <- function(param2) {
>  10 + param2
> }
> cody_local1 <- function(param1) {
>  cody_local2(param1)
> }
> result <- cody_local2(5)
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  cody_local1(5)
> })
> print(result)
> {code}
> We get following error message:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
> (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
> computation failed with
>  Error in cody_local2(param1) : could not find function "cody_local2"
> Calls: compute -> computeFunc -> cody_local1
> {code}
> Compare that to this code block that succeeds:
> {code:java}
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  #cody_local1(5)
> })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block with succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}

  was:
Following code block reproduces the issue:
{code}
library(SparkR)
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}

Compare that to this code block with succeeds:
{code}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code:java}
> library(SparkR)
> sparkR.session()
> spark_df <- createDataFrame(na.omit(airquality))
> cody_local2 <- function(param2) {
>  10 + param2
> }
> cody_local1 <- function(param1) {
>  cody_local2(param1)
> }
> result <- cody_local2(5)
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  cody_local1(5)
> })
> print(result)
> {code}
> We get following error message:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
> (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
> computation failed with
>  Error in cody_local2(param1) : could not find function "cody_local2"
> Calls: compute -> computeFunc -> cody_local1
> {code}
> Compare that to this code block with succeeds:
> {code:java}
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  #cody_local1(5)
> })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)
Hossein Falaki created SPARK-29777:
--

 Summary: SparkR::cleanClosure aggressively removes a function 
required by user function
 Key: SPARK-29777
 URL: https://issues.apache.org/jira/browse/SPARK-29777
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.4
Reporter: Hossein Falaki


Following code block reproduces the issue:
{code}
library(SparkR)
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}

Compare that to this code block with succeeds:
{code}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513008#comment-16513008
 ] 

Hossein Falaki commented on SPARK-24359:


[~shivaram] I like that.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512959#comment-16512959
 ] 

Hossein Falaki commented on SPARK-24359:


Considering that I am volunteering myself to do the housekeeping needed for any 
SparkML maintenance branches, I conclude that we are going to keep this as part 
of main repository. I expect that we will submit to CRAN only when the 
community feels comfortable about stability of the new package (following alpha 
=> beta => GA) process.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% 

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501254#comment-16501254
 ] 

Hossein Falaki edited comment on SPARK-24359 at 6/5/18 3:51 AM:


[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN (one example is Databricks. We build SparkR 
from source) and they would be able to give us feedback while the project is in 
alpha state.


was (Author: falaki):
[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR 
from source.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-04 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501254#comment-16501254
 ] 

Hossein Falaki commented on SPARK-24359:


[~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and 
SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, 
if you think initially this will be unclear, we don't have to submit SparkML to 
CRAN in its first release. Similar to SparkR we can wait a bit until we are 
confident about its compatibility. Many users and distributions, distribute 
SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR 
from source.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-01 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498842#comment-16498842
 ] 

Hossein Falaki commented on SPARK-24359:


Yes. My bad, I meant releasing an update to CRAN for every 2.x and 3.x release. 
However, if Spark does patch releases like 2.3.4, we are not required to push a 
new CRAN package, but that is an opportunity. I guess that is identical to 
SparkR CRAN release cycle.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-26 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparklyr’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 ** create a pipeline by chaining individual components and specifying their 
parameters
 ** tune a pipeline in parallel, taking advantage of Spark
 ** inspect a pipeline’s parameters and evaluation metrics
 ** repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All functions 
are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). 
If a constructor gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with {{spark_}}.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark_tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- tokenizer %>% transform(df) {code}
h2. 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-26 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Attachment: SparkML_ ML Pipelines in R-v3.pdf

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few examples. An 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-26 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491907#comment-16491907
 ] 

Hossein Falaki commented on SPARK-24359:


Thank you [~josephkb] and [~felixcheung] for reviewing.

As for a separate repo, since this is going to be just a new directory, I think 
we will contribute it to Apache Spark repository exactly for the reasons you 
mention. Where the code sits does not impact CRAN release management.

As for CRAN release, I think it is reasonable to release SparkML to CRAN with 
every major Spark release. Initially we will not release the package for minor 
releases and expect SparkML 2.4 to work with 2.4.x release of Spark. To 
minimize CRAN check burden, we will run all integration tests (those that 
interact with JVM can call {{SparkR.callJMethod()}}) in Spark Jenkins machines. 
BTW: I expect these tests will be minimal because we can unit-test the code 
generation logic to make sure it generates correct R code.

[~felixcheung]:
 # {{set_input_col}} and {{set_output_col}} will accept any type that MLlib 
uses for {{setInputCol}} and {{setOutputCol}}. In this case we find that 
UnaryTransformer functions take column names which are Strings. 
[http://https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L86|http://https//github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L86]
 # There are two possibilities for reading javadoc. I did not include more 
details in the design document because it seems an implementation detail.

 ## Calling {{javadoc}} and then reading generated HTML files.
 ## Compiling the documents into the jar (using annotations) and then reading 
them in the code generation tool.
 # Done. 
 # Yes, {{training}} is a SparkDataFrame S4 object (which has been imported 
from {{SparkR) – t}}he new package depends on {{SparkR}} and will be imported 
after SparkR.

I updated the document and uploaded a new version.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489944#comment-16489944
 ] 

Hossein Falaki commented on SPARK-24359:


Thank you guys for feedback. I updated the SPIP and the design document to use 
snake_case everywhere. I also added a section to the design document to 
summarize CRAN release strategy. We can write integration tests that run on 
jenkins to detect when we need to re-publish SparkML to CRAN. CRAN tests will 
not include any integration tests that interact with JVM.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All functions 
are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). 
If a constructor gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with {{spark_}}.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark_tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- tokenizer %>% transform(df) {code}
h2. 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-24 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Attachment: SparkML_ ML Pipelines in R-v2.pdf

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few examples. An object of any child 
> class can be 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487815#comment-16487815
 ] 

Hossein Falaki commented on SPARK-24359:


Thanks [~shivaram] and [~zero323]. It seems CRAN release pain has left some 
scar tissue. We can host this new package in a separate repo and maintain it 
for a few release cycles to evaluate CRAN release overhead. If the overhead is 
not too high, we can contribute it back to the main Spark repository. 
Alternatively, we can remove the requirement for co-releasing with Apache Spark 
– only release when there is API changes in the new package. 

As for duplicate API, I see the issue as well. I think there is room for both 
styles (formula-based for simple use cases and pipeline-based for more complex 
programs). Based on feedback from community we can decide if deprecating old 
API makes sense down the road. I have received many requests from SparkR users 
for ability to build pipelines (same way Python and Scala support it).

As for whether this new package will introduce Scala API changes, in my current 
prototype it is very minimal (and can be avoided). Almost all new Scala code is 
for the utility that generates R source code. The idea is, if a patch adds new 
API to MLlib, the contributor can simply execute a command-line tool and 
check-in R wrappers for his/her new API. The goal of this work is to reduce 
maintenance cost for R API in Spark.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-23 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486799#comment-16486799
 ] 

Hossein Falaki commented on SPARK-24359:


Thanks for reviewing [~felixcheung].
 # I wanted all discussions under this ticket rather than a google document. 
 # The plan is to release a new SparkML R package with every new Apache Spark 
(and SparkR) release. I expect all new MLlib API to be exposed in the new 
package.
 # SparkML R package will depend on SparkR and will use 
{{SparkR::sparkR.callJStatic()}} and {{SparkR::sparkR.callJMethod()}} for 
calling JVM functions. The package will import {{SparkR::SparkDataFrame}} 
object from SparkR. 
 # My proposed API style is {{spark.xyz()}} for object construction and 
{{set_xyz() / get_xyz()}} for setters and getters. If you think this will be 
confusing to users, I will update the design doc to stick to {{_}}. We should 
not have camel case in any API. S4 object names can match Spark class names 
(e.g., {{LogisticRegression}}). These are not exposed to users.

Regarding the effort required for submitting and maintaining a package on CRAN, 
I am hoping to minimize the tests that interact with JVM:
 * Unit testing code generation in Spark
 * Relying on {{SparkR::sparkR.callJStatic()}} and 
{{SparkR::sparkR.callJMethod()}} and assuming that they are unit-tested in 
SparkR.

Thanks for linking the original ticket.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an[ [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

[jira] [Created] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-24359:
--

 Summary: SPIP: ML Pipelines in R
 Key: SPARK-24359
 URL: https://issues.apache.org/jira/browse/SPARK-24359
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hossein Falaki
 Attachments: SparkML_ ML Pipelines in R.pdf

h1. Background and motivation

SparkR supports calling MLlib functionality with an[ [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

 

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of

Apache Spark. This new package will be built on top of SparkR’s APIs to expose 
SparkML’s pipeline APIs and functionality.

 

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

 

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:

 

> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1)

 

When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:

 

> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  

 

> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3)

 

Where stage1, stage2, etc 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Attachment: SparkML_ ML Pipelines in R.pdf

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an[ [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
>  
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of
> Apache Spark. This new package will be built on top of SparkR’s APIs to 
> expose SparkML’s pipeline APIs and functionality.
>  
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
>  
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., set_max_iter()). If a constructor 
> gets arguments, they will be named arguments. For example:
>  
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1)
>  
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
>  
> > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 

[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334885#comment-16334885
 ] 

Hossein Falaki commented on SPARK-23114:


[~felixcheung] I don't have any datasets to share. I have not seen a failure 
that was for no good reason. I have been filing individual tickets for each 
failure mode I have seen. I suggest we address them individually.

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes

2018-01-02 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309053#comment-16309053
 ] 

Hossein Falaki commented on SPARK-17762:


I think SPARK-17790 is one place where this limit causes problems. Anywhere 
else we call {{writeBin}} we face similar limitation.

> invokeJava fails when serialized argument list is larger than INT_MAX 
> (2,147,483,647) bytes
> ---
>
> Key: SPARK-17762
> URL: https://issues.apache.org/jira/browse/SPARK-17762
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on 
> the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded 
> limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). 
> To work around it, we can check for this case and serialize the batch in 
> multiple parts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory

2017-12-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301751#comment-16301751
 ] 

Hossein Falaki commented on SPARK-22766:


[~felixcheung] This just makes installation of whatever version of {{lintr}} we 
pick more stable. The two tickets are orthogonal.

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22812) Failing cran-check on master

2017-12-16 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293940#comment-16293940
 ] 

Hossein Falaki commented on SPARK-22812:


Do you know what is being checked in that step? Is it trying to reach a CRAN 
server?

> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>Priority: Minor
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22812) Failing cran-check on master

2017-12-16 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-22812:
---
Priority: Minor  (was: Major)

> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>Priority: Minor
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22812) Failing cran-check on master

2017-12-15 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-22812:
--

 Summary: Failing cran-check on master 
 Key: SPARK-22812
 URL: https://issues.apache.org/jira/browse/SPARK-22812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hossein Falaki


When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
failure message:

{code}
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) :
  dims [product 22] do not match the length of object [0]
{code}

cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22766) Install R linter package in spark lib directory

2017-12-12 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-22766:
--

 Summary: Install R linter package in spark lib directory
 Key: SPARK-22766
 URL: https://issues.apache.org/jira/browse/SPARK-22766
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Hossein Falaki


{{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
package in the default site library location which is 
{{/usr/local/lib/R/site-library}. This is not recommended and can fail because 
we are running this script as jenkins while that directory is owned by root.

We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-26 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221460#comment-16221460
 ] 

Hossein Falaki commented on SPARK-22344:


I don't have solid pointer as to why we are creating these temp directories for 
hive. I think it would be nicer to fix them in Spark. It is a good practice not 
to leave files on /tmp.

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217279#comment-16217279
 ] 

Hossein Falaki commented on SPARK-15799:


Is there a ticket to follow up on new policy violation issue? The package has 
been archived by CRAN.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-18 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209958#comment-16209958
 ] 

Hossein Falaki commented on SPARK-17902:


A simple unit test we could add would be:

{code}
> df <- createDataFrame(iris)
> sapply(iris, typeof) == sapply(collect(df, stringsAsFactors = T), typeof)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width  Species
TRUE TRUE TRUE TRUEFALSE
{code}

As for the solution, I suggest performing the conversion inside [this 
loop|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L1168].

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-10-12 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202385#comment-16202385
 ] 

Hossein Falaki commented on SPARK-15799:


Congrats everyone. Thanks for the hard work on this.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
> Fix For: 2.1.2
>
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-09-25 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179754#comment-16179754
 ] 

Hossein Falaki commented on SPARK-15799:


It seems we can trivially use {{with}} instead of {{attach}}. [~shivaram] do 
you object? Otherwise I can file a ticket and submit a patch.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-09-25 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179716#comment-16179716
 ] 

Hossein Falaki commented on SPARK-15799:


Hi guys, are there any updates on this? Is there anything I can help with?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21940) Support timezone for timestamps in SparkR

2017-09-06 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-21940:
--

 Summary: Support timezone for timestamps in SparkR
 Key: SPARK-21940
 URL: https://issues.apache.org/jira/browse/SPARK-21940
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


{{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
POSIXlt. See following example:

{code}
> x <- data.frame(x = c(Sys.time()))
> x
x
1 2017-09-06 19:17:16
> attr(x$x, "tzone") <- "Europe/Paris"
> x
x
1 2017-09-07 04:17:16
> collect(createDataFrame(x))
x
1 2017-09-06 19:17:16
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-08-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137639#comment-16137639
 ] 

Hossein Falaki commented on SPARK-15799:


Hi [~felixcheung] would you please share more details as to what the feedback 
was? Do we have followup tickets for the work (to address their comments)?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-18 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki resolved SPARK-21450.

Resolution: Not A Bug

> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-18 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091194#comment-16091194
 ] 

Hossein Falaki commented on SPARK-21450:


Thanks guys. I will close it as not an issue.

> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-17 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091149#comment-16091149
 ] 

Hossein Falaki commented on SPARK-21450:


Yes, it was a copy/paste typo. Here is an example: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346827/1860822503083725/936056/latest.html

The error is caused by our deserialization. I just want to confirm if this is 
indeed a bug.

> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-17 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-21450:
---
Description: 
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and [~felixcheung]


  was:
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and [~felixcheung]



> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-17 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-21450:
---
Description: 
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and @felixcheung


  was:
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and @ felixcheung



> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and @felixcheung



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-17 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-21450:
---
Description: 
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and [~felixcheung]


  was:
Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and @felixcheung



> List of NA is flattened inside a SparkR struct type
> ---
>
> Key: SPARK-21450
> URL: https://issues.apache.org/jira/browse/SPARK-21450
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> Consider the following two cases copied from {{test_sparkSQL.R}}:
> {code}
> df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
> schema <- structType(structField("date", "date"))
> s1 <- collect(select(df, from_json(df$col, schema)))
> s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = 
> "dd/MM/")))
> {code}
> If you inspect s1 using {{str(s1)}} you will find:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ : logi NA
> {code}
> But for s2, running {{str(s2)}} results in:
> {code}
> 'data.frame': 2 obs. of  1 variable:
>  $ jsontostructs(col):List of 2
>   ..$ :List of 1
>   .. ..$ date: Date, format: "2014-10-21"
>   .. ..- attr(*, "class")= chr "struct"
> {code}
> I assume this is not intentional and is just a subtle bug. Do you think 
> otherwise? [~shivaram] and [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21450) List of NA is flattened inside a SparkR struct type

2017-07-17 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-21450:
--

 Summary: List of NA is flattened inside a SparkR struct type
 Key: SPARK-21450
 URL: https://issues.apache.org/jira/browse/SPARK-21450
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


Consider the following two cases copied from {{test_sparkSQL.R}}:

{code}
df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
schema <- structType(structField("date", "date"))
s1 <- collect(select(df, from_json(df$col, schema)))
s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/")))
{code}

If you inspect s1 using {{str(s1)}} you will find:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ : logi NA
{code}

But for s2, running {{str(s2)}} results in:
{code}
'data.frame':   2 obs. of  1 variable:
 $ jsontostructs(col):List of 2
  ..$ :List of 1
  .. ..$ date: Date, format: "2014-10-21"
  .. ..- attr(*, "class")= chr "struct"
{code}

I assume this is not intentional and is just a subtle bug. Do you think 
otherwise? [~shivaram] and @ felixcheung




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21263) NumberFormatException is not thrown while converting an invalid string to float/double

2017-07-04 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074109#comment-16074109
 ] 

Hossein Falaki commented on SPARK-21263:


[~sowen] note that user specified the mode to be PERMISSIVE. In this mode CSV 
data source will try to ignore errors and return some result. If the mode is 
FAILFAST, it should throw an exception. I see the permissiveness of different 
modes as follows:

{code}
PERMISSIVE > DROPMALFORMED > FAILFAST
{code}

Here we have different behavior for {{IntegerType}} vs. {{DoubleType}}. That 
needs to be fixed and behavior should be consistent.

> NumberFormatException is not thrown while converting an invalid string to 
> float/double
> --
>
> Key: SPARK-21263
> URL: https://issues.apache.org/jira/browse/SPARK-21263
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: Navya Krishnappa
>
> When reading a below-mentioned data by specifying user-defined schema, 
> exception is not thrown. Refer the details :
> *Data:* 
> 'PatientID','PatientName','TotalBill'
> '1000','Patient1','10u000'
> '1001','Patient2','3'
> '1002','Patient3','4'
> '1003','Patient4','5'
> '1004','Patient5','6'
> *Source code*: 
> Dataset dataset = sparkSession.read().schema(schema)
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> When we collect the dataset data: 
> dataset.collectAsList();
> *Schema1*: 
> [StructField(PatientID,IntegerType,true), 
> StructField(PatientName,StringType,true), 
> StructField(TotalBill,IntegerType,true)]
> *Result *: Throws NumerFormatException 
> Caused by: java.lang.NumberFormatException: For input string: "10u000"
> *Schema2*: 
> [StructField(PatientID,IntegerType,true), 
> StructField(PatientName,StringType,true), 
> StructField(TotalBill,DoubleType,true)]
> *Actual Result*: 
> "PatientID": 1000,
> "NumberOfVisits": "400",
> "TotalBill": 10,
> *Expected Result*: Should throw NumberFormatException for input string 
> "10u000"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-20684:
---
Summary: expose createGlobalTempView and dropGlobalTempView in SparkR  
(was: expose createGlobalTempView in SparkR)

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-10 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005552#comment-16005552
 ] 

Hossein Falaki commented on SPARK-20684:


Yes I agree.

> expose createGlobalTempView in SparkR
> -
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-09 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-20684:
--

 Summary: expose createGlobalTempView in SparkR
 Key: SPARK-20684
 URL: https://issues.apache.org/jira/browse/SPARK-20684
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


This is a useful API that is not exposed in SparkR. It will help with moving 
data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-20661:
--

 Summary: SparkR tableNames() test fails
 Key: SPARK-20661
 URL: https://issues.apache.org/jira/browse/SPARK-20661
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


Due to prior state created by other test cases, testing {{tableNames()}} is 
failing in master.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext

2017-03-24 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-20088:
--

 Summary: Do not create new SparkContext in SparkR 
createSparkContext
 Key: SPARK-20088
 URL: https://issues.apache.org/jira/browse/SPARK-20088
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


In the implementation of {{createSparkContext}}, we are calling 
{code}
 new JavaSparkContext()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame

2017-03-17 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-20007:
--

 Summary: Make SparkR apply() functions robust to workers that 
return empty data.frame
 Key: SPARK-20007
 URL: https://issues.apache.org/jira/browse/SPARK-20007
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


When using {{gapply()}} (or other members of {{apply()}} family) with a schema, 
Spark will try to parse data returned form the R process on each worker as 
Spark DataFrame Rows based on the schema. In this case our provided schema 
suggests that we have six column. When an R worker returns results to JVM, 
SparkSQL will try to access its columns one by one and cast them to proper 
types. If R worker returns nothing, JVM will throw 
{{ArrayIndexOutOfBoundsException}} exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762687#comment-15762687
 ] 

Hossein Falaki commented on SPARK-18924:


Would be good to think about this along with the efforts to have zero-copy data 
sharing between JVM and R. I think if we do that, a lot of the Ser/De problems 
in the data plane will go away.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18011) SparkR serialize "NA" throws exception

2016-10-19 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-18011:
---
Description: 
For some versions of R, if Date has "NA" field, backend will throw negative 
index exception.
To reproduce the problem:

{code}
> a <- as.Date(c("2016-11-11", "NA"))
> b <- as.data.frame(a)
> c <- createDataFrame(b)
> dim(c)

16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NegativeArraySizeException
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
For some versions of R, if Date has "NA" field, backend will throw negative 
index exception.
To reproduce the problem:

> a <- as.Date(c("2016-11-11", "NA"))
> b <- as.data.frame(a)
> c <- createDataFrame(b)
> dim(c)

16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NegativeArraySizeException
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 

[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-15 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578607#comment-15578607
 ] 

Hossein Falaki commented on SPARK-17878:


I think moving it to another ticket is a good idea. One concern is that the API 
would be different between Scala and DDL.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573692#comment-15573692
 ] 

Hossein Falaki commented on SPARK-17916:


Thanks for linking it. Yes they are very much same issues. However, I slightly 
disagree with the proposed solution. I will comment on the PR.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17919:
--

 Summary: Make timeout to RBackend configurable in SparkR
 Key: SPARK-17919
 URL: https://issues.apache.org/jira/browse/SPARK-17919
 Project: Spark
  Issue Type: Story
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


I am working on a project where {{gapply()}} is being used with a large dataset 
that happens to be extremely skewed. On that skewed partition, the user 
function takes more than 2 hours to return and that turns out to be larger than 
the timeout that we hardcode in SparkR for backend connection.

{code}
connectBackend <- function(hostname, port, timeout = 6000) 
{code}

Ideally user should be able to reconfigure Spark and increase the timeout. It 
should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17916:
--

 Summary: CSV data source treats empty string as null no matter 
what nullValue option is
 Key: SPARK-17916
 URL: https://issues.apache.org/jira/browse/SPARK-17916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Hossein Falaki


When user configures {{nullValue}} in CSV data source, in addition to those 
values, all empty string values are also converted to null.

{code}
data:
col1,col2
1,"-"
2,""
{code}

{code}
spark.read.format("csv").option("nullValue", "-")
{code}

We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-13 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572689#comment-15572689
 ] 

Hossein Falaki commented on SPARK-17902:


Thanks for the pointer [~shivaram]. I will submit it patch with a regression 
test. The only obvious side-effect of this bug, is that collected type will be 
String, while it should have been a Factor. What makes it bad is that it is in 
our documentation and it used to work, so it is a regression.

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-12 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17902:
--

 Summary: collect() ignores stringsAsFactors
 Key: SPARK-17902
 URL: https://issues.apache.org/jira/browse/SPARK-17902
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


`collect()` function signature includes an optional flag named 
`stringsAsFactors`. It seems it is completely ignored.

{code}
str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567180#comment-15567180
 ] 

Hossein Falaki commented on SPARK-17878:


Sure. If passing a list is possible it is the better choice. I just don't want 
to block this feature on API change in SparkSQL.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567104#comment-15567104
 ] 

Hossein Falaki commented on SPARK-17878:


That would require API change in SparkSQL. Otherwise, we need to split the 
string passed to {{nullValue}} on a delimiter. Neither these are safe choices. 
I think accepting {{nullValue1}}, {{nullValue2}}, etc (along with existing 
{{nullValue}}) is:
* backwards compatible
* clear
* extensible for other options in future. E.g., quoteCharacter, etc.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17878:
--

 Summary: Support for multiple null values when reading CSV data
 Key: SPARK-17878
 URL: https://issues.apache.org/jira/browse/SPARK-17878
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 2.0.1
Reporter: Hossein Falaki


There are CSV files out there with multiple values that are supposed to be 
interpreted as null. As a result, multiple spark users have asked for this 
feature built into the CSV data source. It can be easily implemented in a 
backwards compatible way:

- Currently CSV data source supports an option named {{nullValue}}.
- We can add logic in {{CSVOptions}} to understands option names that match 
{{nullValue[\d]}}. This way user can specify a query with multiple or one null 
value.

{code}
val df = spark.read.format("CSV").option("nullValue1", 
"-").option("nullValue2", "*")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566541#comment-15566541
 ] 

Hossein Falaki commented on SPARK-17781:


[~shivaram] Thanks for looking into it. I think the problem applies to 
{{dapply}} as well. For example this fails:
{code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> collect(dapply(df, function(x) {data.frame(res = x$date)}, schema = 
> structType(structField("res", "date"
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 52.0 failed 4 times, most recent failure: Lost task 0.3 in stage 52.0 
(TID 10114, 10.0.229.211): java.lang.RuntimeException: java.lang.Double is not 
a valid external type for schema of date
{code}

I spent a few hours getting to the root of it. We have the correct type all the 
way until {{readList}} in {{deserialize.R}}. I instrumented that function. We 
get the correct type from {{readObject()}} but once it is placed in the list it 
loses its type.

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-10 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563944#comment-15563944
 ] 

Hossein Falaki commented on SPARK-17781:


Yes, but somehow inside {{worker.R}} Date fields in the list are treated as 
Double. Case in point is the reproducing example in the body of the ticket. To 
be honest, I am still confused about this too.

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17811) SparkR cannot parallelize data.frame with NA or NULL in Date columns

2016-10-10 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563506#comment-15563506
 ] 

Hossein Falaki commented on SPARK-17811:


Thanks [~wm624] for looking into it. I submitted a small PR to fix the issue.

> SparkR cannot parallelize data.frame with NA or NULL in Date columns
> 
>
> Key: SPARK-17811
> URL: https://issues.apache.org/jira/browse/SPARK-17811
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> To reproduce: 
> {code}
> df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 
> 1:12)
> dim(createDataFrame(df))
> {code}
> We don't seem to have this problem with POSIXlt 
> {code}
> df <- data.frame(Date = as.POSIXlt(as.Date(c(rep("2016-01-10", 10), "NA", 
> "NA"))), id = 1:12)
> dim(createDataFrame(df))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-10 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563193#comment-15563193
 ] 

Hossein Falaki commented on SPARK-17781:


I investigated the issue. The root cause is that Date (and Timestamp) types 
convert to underlying representations when the are put in a list. To see it, do 
following simple test in an R REPL:

{code}
> l <- lapply(1:2, function(x) { Sys.Date() })
> print(paste("list values", l))
[1] "list values 17084" "list values 17084"
{code}

Similar problem happens with POSIXlt and POSIXct types. Therefore in 
{{worker.R}} when we call {{computeFunc(inputData)}} we are dealing with a list 
that contains double values for date fields. 

Right now it seems the safe way to work around it is avoiding Date and Time 
types and instead use String. [~shivaram] and [~felixcheung] do you have any 
ideas?

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17811) SparkR cannot parallelize data.frame with NA or NULL in Date columns

2016-10-06 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17811:
--

 Summary: SparkR cannot parallelize data.frame with NA or NULL in 
Date columns
 Key: SPARK-17811
 URL: https://issues.apache.org/jira/browse/SPARK-17811
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


To reproduce: 

{code}
df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 
1:12)

dim(createDataFrame(df))
{code}

We don't seem to have this problem with POSIXlt 
{code}
df <- data.frame(Date = as.POSIXlt(as.Date(c(rep("2016-01-10", 10), "NA", 
"NA"))), id = 1:12)
dim(createDataFrame(df))
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column

2016-10-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550488#comment-15550488
 ] 

Hossein Falaki commented on SPARK-17774:


I agree. I think if we decouple {{head}} from {{collect}} there is better 
chance that your PR gets through. Would you please rebase it and for now just 
include {{head}}?

> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17790) Support for parallelizing data.frame larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17790:
---
Issue Type: Sub-task  (was: Story)
Parent: SPARK-6235

> Support for parallelizing data.frame larger than 2GB
> 
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17790) Support for parallelizing R data.frame larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17790:
---
Summary: Support for parallelizing R data.frame larger than 2GB  (was: 
Support for parallelizing data.frame larger than 2GB)

> Support for parallelizing R data.frame larger than 2GB
> --
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17790) Support for parallelizing data.frame larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549996#comment-15549996
 ] 

Hossein Falaki commented on SPARK-17790:


Thanks for pointing it out. SPARK-6235 seems to be an umbrella ticket. This  
one can be a subtask of it.

> Support for parallelizing data.frame larger than 2GB
> 
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17790) Support for parallelizing data.frame larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17790:
---
Summary: Support for parallelizing data.frame larger than 2GB  (was: 
Support for parallelizing/creating DataFrame on data larger than 2GB)

> Support for parallelizing data.frame larger than 2GB
> 
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549932#comment-15549932
 ] 

Hossein Falaki commented on SPARK-17790:


[~shivaram] and [~mengxr] just double checking that in all supported SparkR 
deployment modes, Driver R and JVM are on the same machine?

> Support for parallelizing/creating DataFrame on data larger than 2GB
> 
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB

2016-10-05 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17790:
--

 Summary: Support for parallelizing/creating DataFrame on data 
larger than 2GB
 Key: SPARK-17790
 URL: https://issues.apache.org/jira/browse/SPARK-17790
 Project: Spark
  Issue Type: Story
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


This issue is a more specific version of SPARK-17762. 
Supporting larger than 2GB arguments is more general and arguably harder to do 
because the limit exists both in R and JVM (because we receive data as a 
ByteArray). However, to support parallalizing R data.frames that are larger 
than 2GB we can do what PySpark does.

PySpark uses files to transfer bulk data between Python and JVM. It has worked 
well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column

2016-10-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549913#comment-15549913
 ] 

Hossein Falaki commented on SPARK-17774:


I strongly feel {{head}} should work, but I don't have as strong opinion about 
{{collect}}, although it would be nice if that worked too.

On returning error vs. empty vector, I think returning error is safer because 
it will not cause silent problems.

> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{code}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{code}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) {
  return(x$date)
})
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Affects Version/s: 2.0.0

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Affects Version/s: (was: 2.0.1)

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) { return(x$date) })
{{code}}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {{code}}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function( x ) { return(x$date) })
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) {
  return(x$date)
})
{{code}}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) { return(x$date) })
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {{code}}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function( x ) {
>   return(x$date)
> })
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column

2016-10-04 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547232#comment-15547232
 ] 

Hossein Falaki commented on SPARK-17774:


Putting implementation aside, throwing an error for {{head(df$col)}} is bad 
user experience. For all the corner cases where user calls {{head}} on a column 
that do not belong to any DataFrame we can throw appropriate error. Before I 
submit a PR, I would like to get consensus here. 

Also I think {{head}} is more important that {{collect}} because, {{collect}} 
is not an existing R function.

CC [~marmbrus] [~sunrui] [~felixcheung] [~olarayej]

> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17781:
--

 Summary: datetime is serialized as double inside dapply()
 Key: SPARK-17781
 URL: https://issues.apache.org/jira/browse/SPARK-17781
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17774) Add support for head on DataFrame Column

2016-10-03 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17774:
---
Description: 
There was a lot of discussion on SPARK-9325. To summarize the conversation on 
that ticket regardign {{collect}}
* Pro: Ease of use and maximum compatibility with existing R API
* Con: We do not want to increase maintenance cost by opening arbitrary API. 
With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is 
no need for it to work in R.

This ticket is strictly about {{head}}. I propose supporting {{head}} on 
{{Column}} because:
1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
do that on SparkDataFrame they get an error. Not a good experience
2. Adding support for it does not require any change to the backend. It can be 
trivially done in R code. 

  was:
There was a lot of discussion on SPARK-9325. To summarize the conversation on 
that ticket regardign {{collect}}
* Pro: Ease of use and maximum compatibility with existing R API
* Con: We do not want to increase maintenance cost by opening arbitrary API. 
With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is 
no need for it to work in R.

This ticket is strictly about {{head}}. I propose suporting {{head}} on 
{{Column}} because:
1. R users are already used to calling {{head(iris$Sepal.Length}}. When they do 
that on SparkDataFrame they get an error. Not a good experience
2. Adding support for it does not require any change to the backend. It can be 
trivially done in R code. 


> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17774) Add support for head on DataFrame Column

2016-10-03 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17774:
---
Description: 
There was a lot of discussion on SPARK-9325. To summarize the conversation on 
that ticket regardign {{collect}}
* Pro: Ease of use and maximum compatibility with existing R API
* Con: We do not want to increase maintenance cost by opening arbitrary API. 
With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is 
no need for it to work in R.

This ticket is strictly about {{head}}. I propose suporting {{head}} on 
{{Column}} because:
1. R users are already used to calling {{head(iris$Sepal.Length}}. When they do 
that on SparkDataFrame they get an error. Not a good experience
2. Adding support for it does not require any change to the backend. It can be 
trivially done in R code. 

> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose suporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17774) Add support for head on DataFrame Column

2016-10-03 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17774:
--

 Summary: Add support for head on DataFrame Column
 Key: SPARK-17774
 URL: https://issues.apache.org/jira/browse/SPARK-17774
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hossein Falaki






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes

2016-10-02 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17762:
---
Summary: invokeJava fails when serialized argument list is larger than 
INT_MAX (2,147,483,647) bytes  (was: invokeJava serialized argument list is 
larger than INT_MAX (2,147,483,647) bytes)

> invokeJava fails when serialized argument list is larger than INT_MAX 
> (2,147,483,647) bytes
> ---
>
> Key: SPARK-17762
> URL: https://issues.apache.org/jira/browse/SPARK-17762
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on 
> the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded 
> limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). 
> To work around it, we can check for this case and serialize the batch in 
> multiple parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17762) invokeJava serialized argument list is larger than INT_MAX (2,147,483,647) bytes

2016-10-02 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17762:
--

 Summary: invokeJava serialized argument list is larger than 
INT_MAX (2,147,483,647) bytes
 Key: SPARK-17762
 URL: https://issues.apache.org/jira/browse/SPARK-17762
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hossein Falaki


We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on the 
serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded limit 
set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). 

To work around it, we can check for this case and serialize the batch in 
multiple parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Target Version/s: 2.0.1

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Priority: Blocker  (was: Critical)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Priority: Critical  (was: Major)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17442:
--

 Summary: Additional arguments in write.df are not passed to data 
source
 Key: SPARK-17442
 URL: https://issues.apache.org/jira/browse/SPARK-17442
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hossein Falaki


{{write.df}} passes everything in its arguments to underlying data source in 
1.x, but it is not passing header = "true" in Spark 2.0. For example the 
following code snippet produces a header line in older versions of Spark but 
not in 2.0.

{code}
df <- createDataFrame(iris)
write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header = 
"true")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410105#comment-15410105
 ] 

Hossein Falaki commented on SPARK-16883:


Thanks [~shivaram]! This may require changing the serialization. Who thought 
this might happen! ;) 

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16896) Loading csv with duplicate column names

2016-08-04 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408728#comment-15408728
 ] 

Hossein Falaki commented on SPARK-16896:


I suggest we generally follow the restrictions of SparkSQL, which does not 
accept duplicate column names. So I agree with numbering column names if there 
are duplicates.

> Loading csv with duplicate column names
> ---
>
> Key: SPARK-16896
> URL: https://issues.apache.org/jira/browse/SPARK-16896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> It would be great if the library allows us to load csv with duplicate column 
> names. I understand that having duplicate columns in the data is odd but 
> sometimes we get data that has duplicate columns. Getting upstream data like 
> that can happen. We may choose to ignore them but currently there is no way 
> to drop those as we are not able to load them at all. Currently as a 
> pre-processing I loaded the data into R, changed the column names and then 
> make a fixed version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically 
> takes care of such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we 
> have columns like
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16903) nullValue in first field is not respected by CSV source when read

2016-08-04 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408722#comment-15408722
 ] 

Hossein Falaki commented on SPARK-16903:


Thanks for the info. That make me doubt the decision to single out StringType. 
What is your opinion?

> nullValue in first field is not respected by CSV source when read
> -
>
> Key: SPARK-16903
> URL: https://issues.apache.org/jira/browse/SPARK-16903
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> file:
> {code}
> a,-
> -,10
> {code}
> Query:
> {code}
> create temporary table test(key string, val decimal) 
> using com.databricks.spark.csv 
> options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", 
> nullValue "-");
> {code}
> Result:
> {code}
> select count(*) from test where key is null
> 0
> {code}
> But
> {code}
> select count(*) from test where val is null
> 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-04 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408621#comment-15408621
 ] 

Hossein Falaki commented on SPARK-16883:


I think that is because we are not converting {{decimal}} to a {{numeric}} 
properly. In {{pkg/R/types.R}} decimal is supposed to translate to numeric.

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16903) nullValue in first field is not respected by CSV source when read

2016-08-04 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-16903:
--

 Summary: nullValue in first field is not respected by CSV source 
when read
 Key: SPARK-16903
 URL: https://issues.apache.org/jira/browse/SPARK-16903
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hossein Falaki


file:
{code}
a,-
-,10
{code}

Query:
{code}
create temporary table test(key string, val decimal) 
using com.databricks.spark.csv 
options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", 
nullValue "-");
{code}

Result:
{code}
select count(*) from test where key is null
0
{code}
But
{code}
select count(*) from test where val is null
1
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >