[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r96032164 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Thanks so much for digging into this ! I kind of like the behavior of `positions` (distributing things evenly etc.) and also the fact that we will be maintaining similar behavior as scala with this new function. And the `lapply` code fragment is not that much more complex imho in terms of readability So my take is lets go for it ? We can also add a comment saying we are mimicing positions in case somebody wonders why we didn't use a one-liner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95939102 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- And so we this subtlety is significant we could change to this. It's a slightly more involved change but it would match Scala exactly. ``` splits <- unlist(lapply(0: (numSlices - 1), function(x) { start <- trunc((x * length)/numSlices) end <- trunc(((x + 1) * length)/numSlices) rep(start, end - start) })) ``` And you get this sequence for length <- 50, numSlices <- 22 ``` [1] 0 0 2 2 4 4 6 6 6 9 9 11 11 13 13 15 15 15 18 18 20 20 22 22 22 [26] 25 25 27 27 29 29 31 31 31 34 34 36 36 38 38 40 40 40 43 43 45 45 47 47 47 ``` For calling split() with this sequence is used as.factor - so the numeric values are not significant --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95938844 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Ops I thought we were talking about `numSlices`. Great point about `positions`, and here're what I'm seeing (it's going to be a bit long) ``` postions(50, 20) (0,2) 0 1 (2,5) 2 3 4 (5,7) 5 6 (7,10) 7 8 9 (10,12) 10 11 (12,15) 12 13 14 (15,17) 15 16 (17,20) 17 18 19 (20,22) 20 21 (22,25) 22 23 24 (25,27) 25 26 (27,30) 27 28 29 (30,32) 30 31 (32,35) 32 33 34 (35,37) 35 36 (37,40) 37 38 39 (40,42) 40 41 (42,45) 42 43 44 (45,47) 45 46 (47,50) 47 48 49 sort(rep(1: 20, each = 1, length.out = 50)) [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 [26] 9 9 10 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 ``` As you can see, `positions` attempts to evenly distribute the "extras". ``` positions(50, 24) (0,2) 0 1 (2,4) 2 3 (4,6) 4 5 (6,8) 6 7 (8,10) 8 9 (10,12) 10 11 (12,14) 12 13 (14,16) 14 15 (16,18) 16 17 (18,20) 18 19 (20,22) 20 21 (22,25) 22 23 24 (25,27) 25 26 (27,29) 27 28 (29,31) 29 30 (31,33) 31 32 (33,35) 33 34 (35,37) 35 36 (37,39) 37 38 (39,41) 39 40 (41,43) 41 42 (43,45) 43 44 (45,47) 45 46 (47,50) 47 48 49 sort(rep(1: 24, each = 1, length.out = 50)) [1] 1 1 1 2 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 [26] 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 ``` You see if there're only 2, it puts one in the middle and one at the end. ``` positions(50, 22) (0,2) 0 1 (2,4) 2 3 (4,6) 4 5 (6,9) 6 7 8 (9,11) 9 10 (11,13) 11 12 (13,15) 13 14 (15,18) 15 16 17 (18,20) 18 19 (20,22) 20 21 (22,25) 22 23 24 (25,27) 25 26 (27,29) 27 28 (29,31) 29 30 (31,34) 31 32 33 (34,36) 34 35 (36,38) 36 37 (38,40) 38 39 (40,43) 40 41 42 (43,45) 43 44 (45,47) 45 46 (47,50) 47 48 49 sort(rep(1: 22, each = 1, length.out = 50)) [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 8 8 9 9 10 [26] 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 ``` When there're only a few it is still roughly evenly spaced out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95909368 --- Diff: R/pkg/R/context.R --- @@ -91,6 +91,13 @@ objectFile <- function(sc, path, minPartitions = NULL) { #' will write it to disk and send the file name to JVM. Also to make sure each slice is not #' larger than that limit, number of slices may be increased. #' +#' In 2.2.0 we are changing how the numSlices are used/computed to handle +#' 1 < (length(coll) / numSlices) << length(coll) better. This is safe because +#' parallelize() is not exposed publically. In the specific one case that it is used to convert +#' R native object into SparkDataFrame, it has always been keeping it at the default of 1. --- End diff -- nit: keeping it -> kept --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95909985 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Oh I was not referring to computing `numSlices` but commenting on how the `positions` function was used to actually perform the split in https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L123 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95910083 --- Diff: R/pkg/R/context.R --- @@ -91,6 +91,13 @@ objectFile <- function(sc, path, minPartitions = NULL) { #' will write it to disk and send the file name to JVM. Also to make sure each slice is not #' larger than that limit, number of slices may be increased. #' +#' In 2.2.0 we are changing how the numSlices are used/computed to handle +#' 1 < (length(coll) / numSlices) << length(coll) better. This is safe because +#' parallelize() is not exposed publically. In the specific one case that it is used to convert --- End diff -- nit: spelling of publicly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95909536 --- Diff: R/pkg/R/context.R --- @@ -128,12 +135,15 @@ parallelize <- function(sc, coll, numSlices = 1) { objectSize <- object.size(coll) # For large objects we make sure the size of each slice is also smaller than sizeLimit - numSlices <- max(numSlices, ceiling(objectSize / sizeLimit)) - if (numSlices > length(coll)) -numSlices <- length(coll) + numSerializedSlices <- max(numSlices, ceiling(objectSize / sizeLimit)) + if (numSerializedSlices > length(coll)) +numSerializedSlices <- length(coll) + + splits <- sort(rep(1: numSerializedSlices, each = 1, length.out = length(coll))) --- End diff -- Can we add a small comment here on what the splits come out as (something like what you had in the PR comment) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95874458 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- hmm, I'm not finding any automatic assignment to numSlices in Python or ParallelCollectionRDD ([here](https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L118)) I see [defaultParallelism ](https://github.com/apache/spark/blob/31da755c80aed8219c368fd18c72b42e50be46fc/core/src/main/scala/org/apache/spark/SparkContext.scala#L2272). I thought about be smarter than fixing the default partition number at 1, but a) it's been this way since Spark 1.3 b) most R data size loaded from the driver is small anyway and might not be worthwhile to have more than 1 partition --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95865135 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Can we check how the scala implementation works for this ? I am wondering if we can reuse the logic there --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95738253 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- thinking more about this, I think it's preferrable to set each = 1 and length.out to let it repeat to fill it out, as suppose to the current behavior of each = ceiling of some multiplier and then truncate it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95518300 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- How about to add in doc: "the actual number partition can be increased as multiples of spark.r.maxAllocationLimit, or limited by the number of columns in data.frame." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95514830 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- it looks like it is intentional: https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L115 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95514721 --- Diff: R/pkg/R/SQLContext.R --- @@ -186,6 +186,8 @@ getDefaultSqlSource <- function() { #' #' @param data an RDD or list or data.frame. #' @param schema a list of column names or named list (StructType), optional. +#' @param samplingRatio Currently not used. +#' @param numPartitions the number of partitions of the SparkDataFrame. --- End diff -- more in the other comment... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95514615 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Hmm, good point, the behavior is a bit strange and I haven't thought of a concise way to document this. https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L131 Basically, it's the largest of numSlices or ceiling of data size divided by `spark.r.maxAllocationLimit` - *but* limited by length of the data (but this length is wrong if the data is a data.frame - since that length becomes the number of columns). Is this an unintentional behavior (ie. limited always by the number of columns even when the data size is larger then the `spark.r.maxAllocationLimit`)? I can't tell... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95487087 --- Diff: R/pkg/R/SQLContext.R --- @@ -186,6 +186,8 @@ getDefaultSqlSource <- function() { #' #' @param data an RDD or list or data.frame. --- End diff -- While we are at it, can we remove the `RDD` bit from this line ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95487308 --- Diff: R/pkg/R/SQLContext.R --- @@ -186,6 +186,8 @@ getDefaultSqlSource <- function() { #' #' @param data an RDD or list or data.frame. #' @param schema a list of column names or named list (StructType), optional. +#' @param samplingRatio Currently not used. +#' @param numPartitions the number of partitions of the SparkDataFrame. --- End diff -- Related to the comment in the test case it might be good to say what the default is here ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16512#discussion_r95487221 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", { expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) expect_equal(as.list(collect(where(df, df$name == "John"))), list(name = "John", age = 19L, height = 176.5)) + expect_equal(getNumPartitions(toRDD(df)), 1) --- End diff -- Is the default always 1 or does it depend on the machine used ? I vaguely remember the default being inferred from number of cores available or something like that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/16512 [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter ## What changes were proposed in this pull request? To allow specifying number of partitions when the DataFrame is created ## How was this patch tested? manual, unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rnumpart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16512.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16512 commit b66a0ac4748bbf14dfb992aeff95028122b6d7a9 Author: Felix Cheung Date: 2017-01-09T05:16:39Z add numPartitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org