[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023715#comment-17023715 ] Hossein Falaki commented on SPARK-30629: Yes, this is a good example. There must be a solution to avoid the old bug and not fail with such a case. I could not find one. > cleanClosure on recursive call leads to node stack overflow > --- > > Key: SPARK-30629 > URL: https://issues.apache.org/jira/browse/SPARK-30629 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > This problem surfaced while handling SPARK-22817. In theory there are tests, > which cover that problem, but it seems like they have been dead for some > reason. > Reproducible example > {code:r} > f <- function(x) { > f(x) > } > SparkR:::cleanClosure(f) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023712#comment-17023712 ] Hossein Falaki commented on SPARK-30629: Oh yes. I think we can disable that specific test. If it ever ran and passed it must have been because of the other bug. Thanks for catching it. > cleanClosure on recursive call leads to node stack overflow > --- > > Key: SPARK-30629 > URL: https://issues.apache.org/jira/browse/SPARK-30629 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > This problem surfaced while handling SPARK-22817. In theory there are tests, > which cover that problem, but it seems like they have been dead for some > reason. > Reproducible example > {code:r} > f <- function(x) { > f(x) > } > SparkR:::cleanClosure(f) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023710#comment-17023710 ] Hossein Falaki commented on SPARK-30629: [~zero323] that indefinite recursive call will never execute. So it is better to fail during {{cleanClosure()}} rather than execution on workers. > cleanClosure on recursive call leads to node stack overflow > --- > > Key: SPARK-30629 > URL: https://issues.apache.org/jira/browse/SPARK-30629 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > This problem surfaced while handling SPARK-22817. In theory there are tests, > which cover that problem, but it seems like they have been dead for some > reason. > Reproducible example > {code:r} > f <- function(x) { > f(x) > } > SparkR:::cleanClosure(f) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code} df <- createDataFrame(data.frame(x=1)) f1 <- function(x) x + 1 f2 <- function(x) f1(x) + 2 dapplyCollect(df, function(x) { f1(x); f2(x) }) {code} We get following error message: {code} org.apache.spark.SparkException: R computation failed with Error in f1(x) : could not find function "f1" Calls: compute -> computeFunc -> f2 {code} Compare that to this code block with succeeds: {code} dapplyCollect(df, function(x) { f2(x) }) {code} was: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block that succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code} > df <- createDataFrame(data.frame(x=1)) > f1 <- function(x) x + 1 > f2 <- function(x) f1(x) + 2 > dapplyCollect(df, function(x) { f1(x); f2(x) }) > {code} > We get following error message: > {code} > org.apache.spark.SparkException: R computation failed with > Error in f1(x) : could not find function "f1" > Calls: compute -> computeFunc -> f2 > {code} > Compare that to this code block with succeeds: > {code} > dapplyCollect(df, function(x) { f2(x) }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block that succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} was: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code:java} > library(SparkR) > sparkR.session() > spark_df <- createDataFrame(na.omit(airquality)) > cody_local2 <- function(param2) { > 10 + param2 > } > cody_local1 <- function(param1) { > cody_local2(param1) > } > result <- cody_local2(5) > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > cody_local1(5) > }) > print(result) > {code} > We get following error message: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 > (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R > computation failed with > Error in cody_local2(param1) : could not find function "cody_local2" > Calls: compute -> computeFunc -> cody_local1 > {code} > Compare that to this code block that succeeds: > {code:java} > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > #cody_local1(5) > }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} was: Following code block reproduces the issue: {code} library(SparkR) spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code:java} > library(SparkR) > sparkR.session() > spark_df <- createDataFrame(na.omit(airquality)) > cody_local2 <- function(param2) { > 10 + param2 > } > cody_local1 <- function(param1) { > cody_local2(param1) > } > result <- cody_local2(5) > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > cody_local1(5) > }) > print(result) > {code} > We get following error message: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 > (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R > computation failed with > Error in cody_local2(param1) : could not find function "cody_local2" > Calls: compute -> computeFunc -> cody_local1 > {code} > Compare that to this code block with succeeds: > {code:java} > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > #cody_local1(5) > }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
Hossein Falaki created SPARK-29777: -- Summary: SparkR::cleanClosure aggressively removes a function required by user function Key: SPARK-29777 URL: https://issues.apache.org/jira/browse/SPARK-29777 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.4 Reporter: Hossein Falaki Following code block reproduces the issue: {code} library(SparkR) spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513008#comment-16513008 ] Hossein Falaki commented on SPARK-24359: [~shivaram] I like that. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4 classes that inherit from four basic classes. Here > we will list the basic types with a
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512959#comment-16512959 ] Hossein Falaki commented on SPARK-24359: Considering that I am volunteering myself to do the housekeeping needed for any SparkML maintenance branches, I conclude that we are going to keep this as part of main repository. I expect that we will submit to CRAN only when the community feels comfortable about stability of the new package (following alpha => beta => GA) process. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>%
[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501254#comment-16501254 ] Hossein Falaki edited comment on SPARK-24359 at 6/5/18 3:51 AM: [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN (one example is Databricks. We build SparkR from source) and they would be able to give us feedback while the project is in alpha state. was (Author: falaki): [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR from source. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501254#comment-16501254 ] Hossein Falaki commented on SPARK-24359: [~shivaram] what prevents us from creating a tag like SparkML-2.4.0.1 and SparkML-2.4.0.2 (or some other variant like that) in the main Spark repo? Also, if you think initially this will be unclear, we don't have to submit SparkML to CRAN in its first release. Similar to SparkR we can wait a bit until we are confident about its compatibility. Many users and distributions, distribute SparkR from Apache rather than CRAN. One example is Databricks. We build SparkR from source. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr >
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498842#comment-16498842 ] Hossein Falaki commented on SPARK-24359: Yes. My bad, I meant releasing an update to CRAN for every 2.x and 3.x release. However, if Spark does patch releases like 2.3.4, we are not required to push a new CRAN package, but that is an opportunity. I guess that is identical to SparkR CRAN release cycle. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparklyr’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can ** create a pipeline by chaining individual components and specifying their parameters ** tune a pipeline in parallel, taking advantage of Spark ** inspect a pipeline’s parameters and evaluation metrics ** repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark_tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <- tokenizer %>% transform(df) {code} h2.
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Attachment: SparkML_ ML Pipelines in R-v3.pdf > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R-v3.pdf, SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4 classes that inherit from four basic classes. Here > we will list the basic types with a few examples. An
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491907#comment-16491907 ] Hossein Falaki commented on SPARK-24359: Thank you [~josephkb] and [~felixcheung] for reviewing. As for a separate repo, since this is going to be just a new directory, I think we will contribute it to Apache Spark repository exactly for the reasons you mention. Where the code sits does not impact CRAN release management. As for CRAN release, I think it is reasonable to release SparkML to CRAN with every major Spark release. Initially we will not release the package for minor releases and expect SparkML 2.4 to work with 2.4.x release of Spark. To minimize CRAN check burden, we will run all integration tests (those that interact with JVM can call {{SparkR.callJMethod()}}) in Spark Jenkins machines. BTW: I expect these tests will be minimal because we can unit-test the code generation logic to make sure it generates correct R code. [~felixcheung]: # {{set_input_col}} and {{set_output_col}} will accept any type that MLlib uses for {{setInputCol}} and {{setOutputCol}}. In this case we find that UnaryTransformer functions take column names which are Strings. [http://https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L86|http://https//github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L86] # There are two possibilities for reading javadoc. I did not include more details in the design document because it seems an implementation detail. ## Calling {{javadoc}} and then reading generated HTML files. ## Compiling the documents into the jar (using annotations) and then reading them in the code generation tool. # Done. # Yes, {{training}} is a SparkDataFrame S4 object (which has been imported from {{SparkR) – t}}he new package depends on {{SparkR}} and will be imported after SparkR. I updated the document and uploaded a new version. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489944#comment-16489944 ] Hossein Falaki commented on SPARK-24359: Thank you guys for feedback. I updated the SPIP and the design document to use snake_case everywhere. I also added a section to the design document to summarize CRAN release strategy. We can write integration tests that run on jenkins to detect when we need to re-publish SparkML to CRAN. CRAN tests will not include any integration tests that interact with JVM. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) ->
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark_tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <- tokenizer %>% transform(df) {code} h2.
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Attachment: SparkML_ ML Pipelines in R-v2.pdf > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4 classes that inherit from four basic classes. Here > we will list the basic types with a few examples. An object of any child > class can be
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487815#comment-16487815 ] Hossein Falaki commented on SPARK-24359: Thanks [~shivaram] and [~zero323]. It seems CRAN release pain has left some scar tissue. We can host this new package in a separate repo and maintain it for a few release cycles to evaluate CRAN release overhead. If the overhead is not too high, we can contribute it back to the main Spark repository. Alternatively, we can remove the requirement for co-releasing with Apache Spark – only release when there is API changes in the new package. As for duplicate API, I see the issue as well. I think there is room for both styles (formula-based for simple use cases and pipeline-based for more complex programs). Based on feedback from community we can decide if deprecating old API makes sense down the road. I have received many requests from SparkR users for ability to build pipelines (same way Python and Scala support it). As for whether this new package will introduce Scala API changes, in my current prototype it is very minimal (and can be avoided). Almost all new Scala code is for the utility that generates R source code. The idea is, if a patch adds new API to MLlib, the contributor can simply execute a command-line tool and check-in R wrappers for his/her new API. The goal of this work is to reduce maintenance cost for R API in Spark. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural
[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486799#comment-16486799 ] Hossein Falaki commented on SPARK-24359: Thanks for reviewing [~felixcheung]. # I wanted all discussions under this ticket rather than a google document. # The plan is to release a new SparkML R package with every new Apache Spark (and SparkR) release. I expect all new MLlib API to be exposed in the new package. # SparkML R package will depend on SparkR and will use {{SparkR::sparkR.callJStatic()}} and {{SparkR::sparkR.callJMethod()}} for calling JVM functions. The package will import {{SparkR::SparkDataFrame}} object from SparkR. # My proposed API style is {{spark.xyz()}} for object construction and {{set_xyz() / get_xyz()}} for setters and getters. If you think this will be confusing to users, I will update the design doc to stick to {{_}}. We should not have camel case in any API. S4 object names can match Spark class names (e.g., {{LogisticRegression}}). These are not exposed to users. Regarding the effort required for submitting and maintaining a package on CRAN, I am hoping to minimize the tests that interact with JVM: * Unit testing code generation in Spark * Relying on {{SparkR::sparkR.callJStatic()}} and {{SparkR::sparkR.callJMethod()}} and assuming that they are unit-tested in SparkR. Thanks for linking the original ticket. > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All constructors are dot separated (e.g., spark.logistic.regression()) and all setters and getters are snake case (e.g., set_max_iter()). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with spark. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark.tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <-
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All constructors are dot separated (e.g., spark.logistic.regression()) and all setters and getters are snake case (e.g., set_max_iter()). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with spark. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark.tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <-
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All constructors are dot separated (e.g., spark.logistic.regression()) and all setters and getters are snake case (e.g., set_max_iter()). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with spark. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark.tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <-
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an[ [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All constructors are dot separated (e.g., spark.logistic.regression()) and all setters and getters are snake case (e.g., set_max_iter()). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with spark. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame.
[jira] [Created] (SPARK-24359) SPIP: ML Pipelines in R
Hossein Falaki created SPARK-24359: -- Summary: SPIP: ML Pipelines in R Key: SPARK-24359 URL: https://issues.apache.org/jira/browse/SPARK-24359 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 3.0.0 Reporter: Hossein Falaki Attachments: SparkML_ ML Pipelines in R.pdf h1. Background and motivation SparkR supports calling MLlib functionality with an[ [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All constructors are dot separated (e.g., spark.logistic.regression()) and all setters and getters are snake case (e.g., set_max_iter()). If a constructor gets arguments, they will be named arguments. For example: > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1) When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with spark. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3) Where stage1, stage2, etc
[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R
[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-24359: --- Attachment: SparkML_ ML Pipelines in R.pdf > SPIP: ML Pipelines in R > --- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hossein Falaki >Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an[ [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of > Apache Spark. This new package will be built on top of SparkR’s APIs to > expose SparkML’s pipeline APIs and functionality. > > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparkly’s API > is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > * create a pipeline by chaining individual components and specifying their > parameters > * tune a pipeline in parallel, taking advantage of Spark > * inspect a pipeline’s parameters and evaluation metrics > * repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > constructors are dot separated (e.g., spark.logistic.regression()) and all > setters and getters are snake case (e.g., set_max_iter()). If a constructor > gets arguments, they will be named arguments. For example: > > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1) > > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > > > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4
[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334885#comment-16334885 ] Hossein Falaki commented on SPARK-23114: [~felixcheung] I don't have any datasets to share. I have not seen a failure that was for no good reason. I have been filing individual tickets for each failure mode I have seen. I suggest we address them individually. > Spark R 2.3 QA umbrella > --- > > Key: SPARK-23114 > URL: https://issues.apache.org/jira/browse/SPARK-23114 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for SparkR. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Audit new public APIs (from the generated html doc) > ** relative to Spark Scala/Java APIs > ** relative to popular R libraries > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes
[ https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309053#comment-16309053 ] Hossein Falaki commented on SPARK-17762: I think SPARK-17790 is one place where this limit causes problems. Anywhere else we call {{writeBin}} we face similar limitation. > invokeJava fails when serialized argument list is larger than INT_MAX > (2,147,483,647) bytes > --- > > Key: SPARK-17762 > URL: https://issues.apache.org/jira/browse/SPARK-17762 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on > the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded > limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). > To work around it, we can check for this case and serialize the batch in > multiple parts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory
[ https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301751#comment-16301751 ] Hossein Falaki commented on SPARK-22766: [~felixcheung] This just makes installation of whatever version of {{lintr}} we pick more stable. The two tickets are orthogonal. > Install R linter package in spark lib directory > --- > > Key: SPARK-22766 > URL: https://issues.apache.org/jira/browse/SPARK-22766 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Hossein Falaki > > {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} > package in the default site library location which is > {{/usr/local/lib/R/site-library}. This is not recommended and can fail > because we are running this script as jenkins while that directory is owned > by root. > We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22812) Failing cran-check on master
[ https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293940#comment-16293940 ] Hossein Falaki commented on SPARK-22812: Do you know what is being checked in that step? Is it trying to reach a CRAN server? > Failing cran-check on master > - > > Key: SPARK-22812 > URL: https://issues.apache.org/jira/browse/SPARK-22812 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hossein Falaki >Priority: Minor > > When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following > failure message: > {code} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 22] do not match the length of object [0] > {code} > cc [~felixcheung] have you experienced this error before? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22812) Failing cran-check on master
[ https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-22812: --- Priority: Minor (was: Major) > Failing cran-check on master > - > > Key: SPARK-22812 > URL: https://issues.apache.org/jira/browse/SPARK-22812 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hossein Falaki >Priority: Minor > > When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following > failure message: > {code} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 22] do not match the length of object [0] > {code} > cc [~felixcheung] have you experienced this error before? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22812) Failing cran-check on master
Hossein Falaki created SPARK-22812: -- Summary: Failing cran-check on master Key: SPARK-22812 URL: https://issues.apache.org/jira/browse/SPARK-22812 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.0 Reporter: Hossein Falaki When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following failure message: {code} * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 22] do not match the length of object [0] {code} cc [~felixcheung] have you experienced this error before? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22766) Install R linter package in spark lib directory
Hossein Falaki created SPARK-22766: -- Summary: Install R linter package in spark lib directory Key: SPARK-22766 URL: https://issues.apache.org/jira/browse/SPARK-22766 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.1 Reporter: Hossein Falaki {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} package in the default site library location which is {{/usr/local/lib/R/site-library}. This is not recommended and can fail because we are running this script as jenkins while that directory is owned by root. We need to install the linter package in a local directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221460#comment-16221460 ] Hossein Falaki commented on SPARK-22344: I don't have solid pointer as to why we are creating these temp directories for hive. I think it would be nicer to fix them in Spark. It is a good practice not to leave files on /tmp. > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217279#comment-16217279 ] Hossein Falaki commented on SPARK-15799: Is there a ticket to follow up on new policy violation issue? The package has been archived by CRAN. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > Fix For: 2.1.2 > > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209958#comment-16209958 ] Hossein Falaki commented on SPARK-17902: A simple unit test we could add would be: {code} > df <- createDataFrame(iris) > sapply(iris, typeof) == sapply(collect(df, stringsAsFactors = T), typeof) Sepal.Length Sepal.Width Petal.Length Petal.Width Species TRUE TRUE TRUE TRUEFALSE {code} As for the solution, I suggest performing the conversion inside [this loop|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L1168]. > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202385#comment-16202385 ] Hossein Falaki commented on SPARK-15799: Congrats everyone. Thanks for the hard work on this. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > Fix For: 2.1.2 > > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179754#comment-16179754 ] Hossein Falaki commented on SPARK-15799: It seems we can trivially use {{with}} instead of {{attach}}. [~shivaram] do you object? Otherwise I can file a ticket and submit a patch. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179716#comment-16179716 ] Hossein Falaki commented on SPARK-15799: Hi guys, are there any updates on this? Is there anything I can help with? > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21940) Support timezone for timestamps in SparkR
Hossein Falaki created SPARK-21940: -- Summary: Support timezone for timestamps in SparkR Key: SPARK-21940 URL: https://issues.apache.org/jira/browse/SPARK-21940 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and POSIXlt. See following example: {code} > x <- data.frame(x = c(Sys.time())) > x x 1 2017-09-06 19:17:16 > attr(x$x, "tzone") <- "Europe/Paris" > x x 1 2017-09-07 04:17:16 > collect(createDataFrame(x)) x 1 2017-09-06 19:17:16 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137639#comment-16137639 ] Hossein Falaki commented on SPARK-15799: Hi [~felixcheung] would you please share more details as to what the feedback was? Do we have followup tickets for the work (to address their comments)? > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki resolved SPARK-21450. Resolution: Not A Bug > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and [~felixcheung] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091194#comment-16091194 ] Hossein Falaki commented on SPARK-21450: Thanks guys. I will close it as not an issue. > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and [~felixcheung] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091149#comment-16091149 ] Hossein Falaki commented on SPARK-21450: Yes, it was a copy/paste typo. Here is an example: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346827/1860822503083725/936056/latest.html The error is caused by our deserialization. I just want to confirm if this is indeed a bug. > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and [~felixcheung] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-21450: --- Description: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and [~felixcheung] was: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and [~felixcheung] > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and [~felixcheung] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-21450: --- Description: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and @felixcheung was: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and @ felixcheung > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and @felixcheung -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21450) List of NA is flattened inside a SparkR struct type
[ https://issues.apache.org/jira/browse/SPARK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-21450: --- Description: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and [~felixcheung] was: Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and @felixcheung > List of NA is flattened inside a SparkR struct type > --- > > Key: SPARK-21450 > URL: https://issues.apache.org/jira/browse/SPARK-21450 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > Consider the following two cases copied from {{test_sparkSQL.R}}: > {code} > df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) > schema <- structType(structField("date", "date")) > s1 <- collect(select(df, from_json(df$col, schema))) > s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = > "dd/MM/"))) > {code} > If you inspect s1 using {{str(s1)}} you will find: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ : logi NA > {code} > But for s2, running {{str(s2)}} results in: > {code} > 'data.frame': 2 obs. of 1 variable: > $ jsontostructs(col):List of 2 > ..$ :List of 1 > .. ..$ date: Date, format: "2014-10-21" > .. ..- attr(*, "class")= chr "struct" > {code} > I assume this is not intentional and is just a subtle bug. Do you think > otherwise? [~shivaram] and [~felixcheung] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21450) List of NA is flattened inside a SparkR struct type
Hossein Falaki created SPARK-21450: -- Summary: List of NA is flattened inside a SparkR struct type Key: SPARK-21450 URL: https://issues.apache.org/jira/browse/SPARK-21450 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki Consider the following two cases copied from {{test_sparkSQL.R}}: {code} df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) schema <- structType(structField("date", "date")) s1 <- collect(select(df, from_json(df$col, schema))) s2 <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) {code} If you inspect s1 using {{str(s1)}} you will find: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ : logi NA {code} But for s2, running {{str(s2)}} results in: {code} 'data.frame': 2 obs. of 1 variable: $ jsontostructs(col):List of 2 ..$ :List of 1 .. ..$ date: Date, format: "2014-10-21" .. ..- attr(*, "class")= chr "struct" {code} I assume this is not intentional and is just a subtle bug. Do you think otherwise? [~shivaram] and @ felixcheung -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21263) NumberFormatException is not thrown while converting an invalid string to float/double
[ https://issues.apache.org/jira/browse/SPARK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074109#comment-16074109 ] Hossein Falaki commented on SPARK-21263: [~sowen] note that user specified the mode to be PERMISSIVE. In this mode CSV data source will try to ignore errors and return some result. If the mode is FAILFAST, it should throw an exception. I see the permissiveness of different modes as follows: {code} PERMISSIVE > DROPMALFORMED > FAILFAST {code} Here we have different behavior for {{IntegerType}} vs. {{DoubleType}}. That needs to be fixed and behavior should be consistent. > NumberFormatException is not thrown while converting an invalid string to > float/double > -- > > Key: SPARK-21263 > URL: https://issues.apache.org/jira/browse/SPARK-21263 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: Navya Krishnappa > > When reading a below-mentioned data by specifying user-defined schema, > exception is not thrown. Refer the details : > *Data:* > 'PatientID','PatientName','TotalBill' > '1000','Patient1','10u000' > '1001','Patient2','3' > '1002','Patient3','4' > '1003','Patient4','5' > '1004','Patient5','6' > *Source code*: > Dataset dataset = sparkSession.read().schema(schema) > .option(INFER_SCHEMA, "true") > .option(DELIMITER, ",") > .option(QUOTE, "\"") > .option(MODE, Mode.PERMISSIVE) > .csv(sourceFile); > When we collect the dataset data: > dataset.collectAsList(); > *Schema1*: > [StructField(PatientID,IntegerType,true), > StructField(PatientName,StringType,true), > StructField(TotalBill,IntegerType,true)] > *Result *: Throws NumerFormatException > Caused by: java.lang.NumberFormatException: For input string: "10u000" > *Schema2*: > [StructField(PatientID,IntegerType,true), > StructField(PatientName,StringType,true), > StructField(TotalBill,DoubleType,true)] > *Actual Result*: > "PatientID": 1000, > "NumberOfVisits": "400", > "TotalBill": 10, > *Expected Result*: Should throw NumberFormatException for input string > "10u000" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-20684: --- Summary: expose createGlobalTempView and dropGlobalTempView in SparkR (was: expose createGlobalTempView in SparkR) > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005552#comment-16005552 ] Hossein Falaki commented on SPARK-20684: Yes I agree. > expose createGlobalTempView in SparkR > - > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20684) expose createGlobalTempView in SparkR
Hossein Falaki created SPARK-20684: -- Summary: expose createGlobalTempView in SparkR Key: SPARK-20684 URL: https://issues.apache.org/jira/browse/SPARK-20684 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki This is a useful API that is not exposed in SparkR. It will help with moving data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20661) SparkR tableNames() test fails
Hossein Falaki created SPARK-20661: -- Summary: SparkR tableNames() test fails Key: SPARK-20661 URL: https://issues.apache.org/jira/browse/SPARK-20661 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki Due to prior state created by other test cases, testing {{tableNames()}} is failing in master. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20088) Do not create new SparkContext in SparkR createSparkContext
Hossein Falaki created SPARK-20088: -- Summary: Do not create new SparkContext in SparkR createSparkContext Key: SPARK-20088 URL: https://issues.apache.org/jira/browse/SPARK-20088 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki In the implementation of {{createSparkContext}}, we are calling {code} new JavaSparkContext() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame
Hossein Falaki created SPARK-20007: -- Summary: Make SparkR apply() functions robust to workers that return empty data.frame Key: SPARK-20007 URL: https://issues.apache.org/jira/browse/SPARK-20007 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki When using {{gapply()}} (or other members of {{apply()}} family) with a schema, Spark will try to parse data returned form the R process on each worker as Spark DataFrame Rows based on the schema. In this case our provided schema suggests that we have six column. When an R worker returns results to JVM, SparkSQL will try to access its columns one by one and cast them to proper types. If R worker returns nothing, JVM will throw {{ArrayIndexOutOfBoundsException}} exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762687#comment-15762687 ] Hossein Falaki commented on SPARK-18924: Would be good to think about this along with the efforts to have zero-copy data sharing between JVM and R. I think if we do that, a lot of the Ser/De problems in the data plane will go away. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18011) SparkR serialize "NA" throws exception
[ https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-18011: --- Description: For some versions of R, if Date has "NA" field, backend will throw negative index exception. To reproduce the problem: {code} > a <- as.Date(c("2016-11-11", "NA")) > b <- as.data.frame(a) > c <- createDataFrame(b) > dim(c) 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: For some versions of R, if Date has "NA" field, backend will throw negative index exception. To reproduce the problem: > a <- as.Date(c("2016-11-11", "NA")) > b <- as.data.frame(a) > c <- createDataFrame(b) > dim(c) 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578607#comment-15578607 ] Hossein Falaki commented on SPARK-17878: I think moving it to another ticket is a good idea. One concern is that the API would be different between Scala and DDL. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573692#comment-15573692 ] Hossein Falaki commented on SPARK-17916: Thanks for linking it. Yes they are very much same issues. However, I slightly disagree with the proposed solution. I will comment on the PR. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17919) Make timeout to RBackend configurable in SparkR
Hossein Falaki created SPARK-17919: -- Summary: Make timeout to RBackend configurable in SparkR Key: SPARK-17919 URL: https://issues.apache.org/jira/browse/SPARK-17919 Project: Spark Issue Type: Story Components: SparkR Affects Versions: 2.0.1 Reporter: Hossein Falaki I am working on a project where {{gapply()}} is being used with a large dataset that happens to be extremely skewed. On that skewed partition, the user function takes more than 2 hours to return and that turns out to be larger than the timeout that we hardcode in SparkR for backend connection. {code} connectBackend <- function(hostname, port, timeout = 6000) {code} Ideally user should be able to reconfigure Spark and increase the timeout. It should be a small fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
Hossein Falaki created SPARK-17916: -- Summary: CSV data source treats empty string as null no matter what nullValue option is Key: SPARK-17916 URL: https://issues.apache.org/jira/browse/SPARK-17916 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Hossein Falaki When user configures {{nullValue}} in CSV data source, in addition to those values, all empty string values are also converted to null. {code} data: col1,col2 1,"-" 2,"" {code} {code} spark.read.format("csv").option("nullValue", "-") {code} We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572689#comment-15572689 ] Hossein Falaki commented on SPARK-17902: Thanks for the pointer [~shivaram]. I will submit it patch with a regression test. The only obvious side-effect of this bug, is that collected type will be String, while it should have been a Factor. What makes it bad is that it is in our documentation and it used to work, so it is a regression. > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17902) collect() ignores stringsAsFactors
Hossein Falaki created SPARK-17902: -- Summary: collect() ignores stringsAsFactors Key: SPARK-17902 URL: https://issues.apache.org/jira/browse/SPARK-17902 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.1 Reporter: Hossein Falaki `collect()` function signature includes an optional flag named `stringsAsFactors`. It seems it is completely ignored. {code} str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567180#comment-15567180 ] Hossein Falaki commented on SPARK-17878: Sure. If passing a list is possible it is the better choice. I just don't want to block this feature on API change in SparkSQL. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567104#comment-15567104 ] Hossein Falaki commented on SPARK-17878: That would require API change in SparkSQL. Otherwise, we need to split the string passed to {{nullValue}} on a delimiter. Neither these are safe choices. I think accepting {{nullValue1}}, {{nullValue2}}, etc (along with existing {{nullValue}}) is: * backwards compatible * clear * extensible for other options in future. E.g., quoteCharacter, etc. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17878) Support for multiple null values when reading CSV data
Hossein Falaki created SPARK-17878: -- Summary: Support for multiple null values when reading CSV data Key: SPARK-17878 URL: https://issues.apache.org/jira/browse/SPARK-17878 Project: Spark Issue Type: Story Components: SQL Affects Versions: 2.0.1 Reporter: Hossein Falaki There are CSV files out there with multiple values that are supposed to be interpreted as null. As a result, multiple spark users have asked for this feature built into the CSV data source. It can be easily implemented in a backwards compatible way: - Currently CSV data source supports an option named {{nullValue}}. - We can add logic in {{CSVOptions}} to understands option names that match {{nullValue[\d]}}. This way user can specify a query with multiple or one null value. {code} val df = spark.read.format("CSV").option("nullValue1", "-").option("nullValue2", "*") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566541#comment-15566541 ] Hossein Falaki commented on SPARK-17781: [~shivaram] Thanks for looking into it. I think the problem applies to {{dapply}} as well. For example this fails: {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > collect(dapply(df, function(x) {data.frame(res = x$date)}, schema = > structType(structField("res", "date" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 52.0 failed 4 times, most recent failure: Lost task 0.3 in stage 52.0 (TID 10114, 10.0.229.211): java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of date {code} I spent a few hours getting to the root of it. We have the correct type all the way until {{readList}} in {{deserialize.R}}. I instrumented that function. We get the correct type from {{readObject()}} but once it is placed in the list it loses its type. > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563944#comment-15563944 ] Hossein Falaki commented on SPARK-17781: Yes, but somehow inside {{worker.R}} Date fields in the list are treated as Double. Case in point is the reproducing example in the body of the ticket. To be honest, I am still confused about this too. > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17811) SparkR cannot parallelize data.frame with NA or NULL in Date columns
[ https://issues.apache.org/jira/browse/SPARK-17811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563506#comment-15563506 ] Hossein Falaki commented on SPARK-17811: Thanks [~wm624] for looking into it. I submitted a small PR to fix the issue. > SparkR cannot parallelize data.frame with NA or NULL in Date columns > > > Key: SPARK-17811 > URL: https://issues.apache.org/jira/browse/SPARK-17811 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > To reproduce: > {code} > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = > 1:12) > dim(createDataFrame(df)) > {code} > We don't seem to have this problem with POSIXlt > {code} > df <- data.frame(Date = as.POSIXlt(as.Date(c(rep("2016-01-10", 10), "NA", > "NA"))), id = 1:12) > dim(createDataFrame(df)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563193#comment-15563193 ] Hossein Falaki commented on SPARK-17781: I investigated the issue. The root cause is that Date (and Timestamp) types convert to underlying representations when the are put in a list. To see it, do following simple test in an R REPL: {code} > l <- lapply(1:2, function(x) { Sys.Date() }) > print(paste("list values", l)) [1] "list values 17084" "list values 17084" {code} Similar problem happens with POSIXlt and POSIXct types. Therefore in {{worker.R}} when we call {{computeFunc(inputData)}} we are dealing with a list that contains double values for date fields. Right now it seems the safe way to work around it is avoiding Date and Time types and instead use String. [~shivaram] and [~felixcheung] do you have any ideas? > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17811) SparkR cannot parallelize data.frame with NA or NULL in Date columns
Hossein Falaki created SPARK-17811: -- Summary: SparkR cannot parallelize data.frame with NA or NULL in Date columns Key: SPARK-17811 URL: https://issues.apache.org/jira/browse/SPARK-17811 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.1 Reporter: Hossein Falaki To reproduce: {code} df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) dim(createDataFrame(df)) {code} We don't seem to have this problem with POSIXlt {code} df <- data.frame(Date = as.POSIXlt(as.Date(c(rep("2016-01-10", 10), "NA", "NA"))), id = 1:12) dim(createDataFrame(df)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550488#comment-15550488 ] Hossein Falaki commented on SPARK-17774: I agree. I think if we decouple {{head}} from {{collect}} there is better chance that your PR gets through. Would you please rebase it and for now just include {{head}}? > Add support for head on DataFrame Column > > > Key: SPARK-17774 > URL: https://issues.apache.org/jira/browse/SPARK-17774 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > There was a lot of discussion on SPARK-9325. To summarize the conversation on > that ticket regardign {{collect}} > * Pro: Ease of use and maximum compatibility with existing R API > * Con: We do not want to increase maintenance cost by opening arbitrary API. > With Spark's DataFrame API {{collect}} does not work on {{Column}} and there > is no need for it to work in R. > This ticket is strictly about {{head}}. I propose supporting {{head}} on > {{Column}} because: > 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they > do that on SparkDataFrame they get an error. Not a good experience > 2. Adding support for it does not require any change to the backend. It can > be trivially done in R code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17790) Support for parallelizing data.frame larger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17790: --- Issue Type: Sub-task (was: Story) Parent: SPARK-6235 > Support for parallelizing data.frame larger than 2GB > > > Key: SPARK-17790 > URL: https://issues.apache.org/jira/browse/SPARK-17790 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > This issue is a more specific version of SPARK-17762. > Supporting larger than 2GB arguments is more general and arguably harder to > do because the limit exists both in R and JVM (because we receive data as a > ByteArray). However, to support parallalizing R data.frames that are larger > than 2GB we can do what PySpark does. > PySpark uses files to transfer bulk data between Python and JVM. It has > worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17790) Support for parallelizing R data.frame larger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17790: --- Summary: Support for parallelizing R data.frame larger than 2GB (was: Support for parallelizing data.frame larger than 2GB) > Support for parallelizing R data.frame larger than 2GB > -- > > Key: SPARK-17790 > URL: https://issues.apache.org/jira/browse/SPARK-17790 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > This issue is a more specific version of SPARK-17762. > Supporting larger than 2GB arguments is more general and arguably harder to > do because the limit exists both in R and JVM (because we receive data as a > ByteArray). However, to support parallalizing R data.frames that are larger > than 2GB we can do what PySpark does. > PySpark uses files to transfer bulk data between Python and JVM. It has > worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17790) Support for parallelizing data.frame larger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549996#comment-15549996 ] Hossein Falaki commented on SPARK-17790: Thanks for pointing it out. SPARK-6235 seems to be an umbrella ticket. This one can be a subtask of it. > Support for parallelizing data.frame larger than 2GB > > > Key: SPARK-17790 > URL: https://issues.apache.org/jira/browse/SPARK-17790 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > This issue is a more specific version of SPARK-17762. > Supporting larger than 2GB arguments is more general and arguably harder to > do because the limit exists both in R and JVM (because we receive data as a > ByteArray). However, to support parallalizing R data.frames that are larger > than 2GB we can do what PySpark does. > PySpark uses files to transfer bulk data between Python and JVM. It has > worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17790) Support for parallelizing data.frame larger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17790: --- Summary: Support for parallelizing data.frame larger than 2GB (was: Support for parallelizing/creating DataFrame on data larger than 2GB) > Support for parallelizing data.frame larger than 2GB > > > Key: SPARK-17790 > URL: https://issues.apache.org/jira/browse/SPARK-17790 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > This issue is a more specific version of SPARK-17762. > Supporting larger than 2GB arguments is more general and arguably harder to > do because the limit exists both in R and JVM (because we receive data as a > ByteArray). However, to support parallalizing R data.frames that are larger > than 2GB we can do what PySpark does. > PySpark uses files to transfer bulk data between Python and JVM. It has > worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549932#comment-15549932 ] Hossein Falaki commented on SPARK-17790: [~shivaram] and [~mengxr] just double checking that in all supported SparkR deployment modes, Driver R and JVM are on the same machine? > Support for parallelizing/creating DataFrame on data larger than 2GB > > > Key: SPARK-17790 > URL: https://issues.apache.org/jira/browse/SPARK-17790 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > This issue is a more specific version of SPARK-17762. > Supporting larger than 2GB arguments is more general and arguably harder to > do because the limit exists both in R and JVM (because we receive data as a > ByteArray). However, to support parallalizing R data.frames that are larger > than 2GB we can do what PySpark does. > PySpark uses files to transfer bulk data between Python and JVM. It has > worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB
Hossein Falaki created SPARK-17790: -- Summary: Support for parallelizing/creating DataFrame on data larger than 2GB Key: SPARK-17790 URL: https://issues.apache.org/jira/browse/SPARK-17790 Project: Spark Issue Type: Story Components: SparkR Affects Versions: 2.0.1 Reporter: Hossein Falaki This issue is a more specific version of SPARK-17762. Supporting larger than 2GB arguments is more general and arguably harder to do because the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support parallalizing R data.frames that are larger than 2GB we can do what PySpark does. PySpark uses files to transfer bulk data between Python and JVM. It has worked well for the large community of Spark Python users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549913#comment-15549913 ] Hossein Falaki commented on SPARK-17774: I strongly feel {{head}} should work, but I don't have as strong opinion about {{collect}}, although it would be nice if that worked too. On returning error vs. empty vector, I think returning error is safer because it will not cause silent problems. > Add support for head on DataFrame Column > > > Key: SPARK-17774 > URL: https://issues.apache.org/jira/browse/SPARK-17774 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > There was a lot of discussion on SPARK-9325. To summarize the conversation on > that ticket regardign {{collect}} > * Pro: Ease of use and maximum compatibility with existing R API > * Con: We do not want to increase maintenance cost by opening arbitrary API. > With Spark's DataFrame API {{collect}} does not work on {{Column}} and there > is no need for it to work in R. > This ticket is strictly about {{head}}. I propose supporting {{head}} on > {{Column}} because: > 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they > do that on SparkDataFrame they get an error. Not a good experience > 2. Adding support for it does not require any change to the backend. It can > be trivially done in R code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17781: --- Description: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {code} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function(x) { return(x$date) }) {code} was: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function( x ) { return(x$date) }) {{code}} > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17781: --- Affects Version/s: 2.0.0 > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17781: --- Affects Version/s: (was: 2.0.1) > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17781: --- Description: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function( x ) { return(x$date) }) {{code}} was: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function(x) { return(x$date) }) {{code}} > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {{code}} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function( x ) { return(x$date) }) > {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17781: --- Description: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function( x ) { return(x$date) }) {{code}} was: When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function( x ) { return(x$date) }) {{code}} > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {{code}} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function( x ) { > return(x$date) > }) > {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547232#comment-15547232 ] Hossein Falaki commented on SPARK-17774: Putting implementation aside, throwing an error for {{head(df$col)}} is bad user experience. For all the corner cases where user calls {{head}} on a column that do not belong to any DataFrame we can throw appropriate error. Before I submit a PR, I would like to get consensus here. Also I think {{head}} is more important that {{collect}} because, {{collect}} is not an existing R function. CC [~marmbrus] [~sunrui] [~felixcheung] [~olarayej] > Add support for head on DataFrame Column > > > Key: SPARK-17774 > URL: https://issues.apache.org/jira/browse/SPARK-17774 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > There was a lot of discussion on SPARK-9325. To summarize the conversation on > that ticket regardign {{collect}} > * Pro: Ease of use and maximum compatibility with existing R API > * Con: We do not want to increase maintenance cost by opening arbitrary API. > With Spark's DataFrame API {{collect}} does not work on {{Column}} and there > is no need for it to work in R. > This ticket is strictly about {{head}}. I propose supporting {{head}} on > {{Column}} because: > 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they > do that on SparkDataFrame they get an error. Not a good experience > 2. Adding support for it does not require any change to the backend. It can > be trivially done in R code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17781) datetime is serialized as double inside dapply()
Hossein Falaki created SPARK-17781: -- Summary: datetime is serialized as double inside dapply() Key: SPARK-17781 URL: https://issues.apache.org/jira/browse/SPARK-17781 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.1 Reporter: Hossein Falaki When we ship a SparkDataFrame to workers for dapply family functions, inside the worker DateTime objects are serialized as double. To reproduce: {{code}} df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) dapplyCollect(df, function(x) { return(x$date) }) {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17774) Add support for head on DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17774: --- Description: There was a lot of discussion on SPARK-9325. To summarize the conversation on that ticket regardign {{collect}} * Pro: Ease of use and maximum compatibility with existing R API * Con: We do not want to increase maintenance cost by opening arbitrary API. With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is no need for it to work in R. This ticket is strictly about {{head}}. I propose supporting {{head}} on {{Column}} because: 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they do that on SparkDataFrame they get an error. Not a good experience 2. Adding support for it does not require any change to the backend. It can be trivially done in R code. was: There was a lot of discussion on SPARK-9325. To summarize the conversation on that ticket regardign {{collect}} * Pro: Ease of use and maximum compatibility with existing R API * Con: We do not want to increase maintenance cost by opening arbitrary API. With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is no need for it to work in R. This ticket is strictly about {{head}}. I propose suporting {{head}} on {{Column}} because: 1. R users are already used to calling {{head(iris$Sepal.Length}}. When they do that on SparkDataFrame they get an error. Not a good experience 2. Adding support for it does not require any change to the backend. It can be trivially done in R code. > Add support for head on DataFrame Column > > > Key: SPARK-17774 > URL: https://issues.apache.org/jira/browse/SPARK-17774 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > There was a lot of discussion on SPARK-9325. To summarize the conversation on > that ticket regardign {{collect}} > * Pro: Ease of use and maximum compatibility with existing R API > * Con: We do not want to increase maintenance cost by opening arbitrary API. > With Spark's DataFrame API {{collect}} does not work on {{Column}} and there > is no need for it to work in R. > This ticket is strictly about {{head}}. I propose supporting {{head}} on > {{Column}} because: > 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they > do that on SparkDataFrame they get an error. Not a good experience > 2. Adding support for it does not require any change to the backend. It can > be trivially done in R code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17774) Add support for head on DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17774: --- Description: There was a lot of discussion on SPARK-9325. To summarize the conversation on that ticket regardign {{collect}} * Pro: Ease of use and maximum compatibility with existing R API * Con: We do not want to increase maintenance cost by opening arbitrary API. With Spark's DataFrame API {{collect}} does not work on {{Column}} and there is no need for it to work in R. This ticket is strictly about {{head}}. I propose suporting {{head}} on {{Column}} because: 1. R users are already used to calling {{head(iris$Sepal.Length}}. When they do that on SparkDataFrame they get an error. Not a good experience 2. Adding support for it does not require any change to the backend. It can be trivially done in R code. > Add support for head on DataFrame Column > > > Key: SPARK-17774 > URL: https://issues.apache.org/jira/browse/SPARK-17774 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > There was a lot of discussion on SPARK-9325. To summarize the conversation on > that ticket regardign {{collect}} > * Pro: Ease of use and maximum compatibility with existing R API > * Con: We do not want to increase maintenance cost by opening arbitrary API. > With Spark's DataFrame API {{collect}} does not work on {{Column}} and there > is no need for it to work in R. > This ticket is strictly about {{head}}. I propose suporting {{head}} on > {{Column}} because: > 1. R users are already used to calling {{head(iris$Sepal.Length}}. When they > do that on SparkDataFrame they get an error. Not a good experience > 2. Adding support for it does not require any change to the backend. It can > be trivially done in R code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17774) Add support for head on DataFrame Column
Hossein Falaki created SPARK-17774: -- Summary: Add support for head on DataFrame Column Key: SPARK-17774 URL: https://issues.apache.org/jira/browse/SPARK-17774 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.0.0 Reporter: Hossein Falaki -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes
[ https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17762: --- Summary: invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes (was: invokeJava serialized argument list is larger than INT_MAX (2,147,483,647) bytes) > invokeJava fails when serialized argument list is larger than INT_MAX > (2,147,483,647) bytes > --- > > Key: SPARK-17762 > URL: https://issues.apache.org/jira/browse/SPARK-17762 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on > the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded > limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). > To work around it, we can check for this case and serialize the batch in > multiple parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17762) invokeJava serialized argument list is larger than INT_MAX (2,147,483,647) bytes
Hossein Falaki created SPARK-17762: -- Summary: invokeJava serialized argument list is larger than INT_MAX (2,147,483,647) bytes Key: SPARK-17762 URL: https://issues.apache.org/jira/browse/SPARK-17762 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Hossein Falaki We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). To work around it, we can check for this case and serialize the batch in multiple parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source
[ https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17442: --- Target Version/s: 2.0.1 > Additional arguments in write.df are not passed to data source > -- > > Key: SPARK-17442 > URL: https://issues.apache.org/jira/browse/SPARK-17442 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Priority: Blocker > > {{write.df}} passes everything in its arguments to underlying data source in > 1.x, but it is not passing header = "true" in Spark 2.0. For example the > following code snippet produces a header line in older versions of Spark but > not in 2.0. > {code} > df <- createDataFrame(iris) > write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header > = "true") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source
[ https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17442: --- Priority: Blocker (was: Critical) > Additional arguments in write.df are not passed to data source > -- > > Key: SPARK-17442 > URL: https://issues.apache.org/jira/browse/SPARK-17442 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Priority: Blocker > > {{write.df}} passes everything in its arguments to underlying data source in > 1.x, but it is not passing header = "true" in Spark 2.0. For example the > following code snippet produces a header line in older versions of Spark but > not in 2.0. > {code} > df <- createDataFrame(iris) > write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header > = "true") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source
[ https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-17442: --- Priority: Critical (was: Major) > Additional arguments in write.df are not passed to data source > -- > > Key: SPARK-17442 > URL: https://issues.apache.org/jira/browse/SPARK-17442 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Priority: Critical > > {{write.df}} passes everything in its arguments to underlying data source in > 1.x, but it is not passing header = "true" in Spark 2.0. For example the > following code snippet produces a header line in older versions of Spark but > not in 2.0. > {code} > df <- createDataFrame(iris) > write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header > = "true") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17442) Additional arguments in write.df are not passed to data source
Hossein Falaki created SPARK-17442: -- Summary: Additional arguments in write.df are not passed to data source Key: SPARK-17442 URL: https://issues.apache.org/jira/browse/SPARK-17442 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Hossein Falaki {{write.df}} passes everything in its arguments to underlying data source in 1.x, but it is not passing header = "true" in Spark 2.0. For example the following code snippet produces a header line in older versions of Spark but not in 2.0. {code} df <- createDataFrame(iris) write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header = "true") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410105#comment-15410105 ] Hossein Falaki commented on SPARK-16883: Thanks [~shivaram]! This may require changing the serialization. Who thought this might happen! ;) > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16896) Loading csv with duplicate column names
[ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408728#comment-15408728 ] Hossein Falaki commented on SPARK-16896: I suggest we generally follow the restrictions of SparkSQL, which does not accept duplicate column names. So I agree with numbering column names if there are duplicates. > Loading csv with duplicate column names > --- > > Key: SPARK-16896 > URL: https://issues.apache.org/jira/browse/SPARK-16896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > It would be great if the library allows us to load csv with duplicate column > names. I understand that having duplicate columns in the data is odd but > sometimes we get data that has duplicate columns. Getting upstream data like > that can happen. We may choose to ignore them but currently there is no way > to drop those as we are not able to load them at all. Currently as a > pre-processing I loaded the data into R, changed the column names and then > make a fixed version with which Spark Java API can work. > But if talk about other options, e.g. R has read.csv which automatically > takes care of such situation by appending a number to the column name. > Also case sensitivity in column names can also cause problems. I mean if we > have columns like > ColumnName, columnName > I may want to have them as separate. But the option to do this is not > documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16903) nullValue in first field is not respected by CSV source when read
[ https://issues.apache.org/jira/browse/SPARK-16903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408722#comment-15408722 ] Hossein Falaki commented on SPARK-16903: Thanks for the info. That make me doubt the decision to single out StringType. What is your opinion? > nullValue in first field is not respected by CSV source when read > - > > Key: SPARK-16903 > URL: https://issues.apache.org/jira/browse/SPARK-16903 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > file: > {code} > a,- > -,10 > {code} > Query: > {code} > create temporary table test(key string, val decimal) > using com.databricks.spark.csv > options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", > nullValue "-"); > {code} > Result: > {code} > select count(*) from test where key is null > 0 > {code} > But > {code} > select count(*) from test where val is null > 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408621#comment-15408621 ] Hossein Falaki commented on SPARK-16883: I think that is because we are not converting {{decimal}} to a {{numeric}} properly. In {{pkg/R/types.R}} decimal is supposed to translate to numeric. > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16903) nullValue in first field is not respected by CSV source when read
Hossein Falaki created SPARK-16903: -- Summary: nullValue in first field is not respected by CSV source when read Key: SPARK-16903 URL: https://issues.apache.org/jira/browse/SPARK-16903 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Hossein Falaki file: {code} a,- -,10 {code} Query: {code} create temporary table test(key string, val decimal) using com.databricks.spark.csv options (path "/tmp/hossein2/null.csv", header "false", delimiter ",", nullValue "-"); {code} Result: {code} select count(*) from test where key is null 0 {code} But {code} select count(*) from test where val is null 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org