[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-13 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r96032164
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Thanks so much for digging into this ! I kind of like the behavior of 
`positions` (distributing things evenly etc.) and also the fact that we will be 
maintaining similar behavior as scala with this new function.  And the `lapply` 
code fragment is not that much more complex imho in terms of readability

So my take is lets go for it ? We can also add a comment saying we are 
mimicing positions in case somebody wonders why we didn't use a one-liner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95939102
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

And so we this subtlety is significant we could change to this. It's a 
slightly more involved change but it would match Scala exactly.

```
splits <-
unlist(lapply(0: (numSlices - 1), function(x) {
   start <- trunc((x * length)/numSlices)
   end <- trunc(((x + 1) * length)/numSlices)
   rep(start, end - start)
}))
```

And you get this sequence for length <- 50, numSlices <- 22
```
 [1]  0  0  2  2  4  4  6  6  6  9  9 11 11 13 13 15 15 15 18 18 20 20 22 
22 22
[26] 25 25 27 27 29 29 31 31 31 34 34 36 36 38 38 40 40 40 43 43 45 45 47 
47 47
```

For calling split() with this sequence is used as.factor - so the numeric 
values are not significant



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95938844
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Ops I thought we were talking about `numSlices`. Great point about 
`positions`, and here're what I'm seeing (it's going to be a bit long)

```
postions(50, 20)
(0,2)   0 1
(2,5)   2 3 4
(5,7)   5 6 
(7,10)  7 8 9
(10,12) 10 11
(12,15) 12 13 14
(15,17) 15 16
(17,20) 17 18 19
(20,22) 20 21
(22,25) 22 23 24
(25,27) 25 26
(27,30) 27 28 29
(30,32) 30 31
(32,35) 32 33 34
(35,37) 35 36 
(37,40) 37 38 39
(40,42) 40 41
(42,45) 42 43 44
(45,47) 45 46
(47,50) 47 48 49 

sort(rep(1: 20, each = 1, length.out = 50))
 [1]  1  1  1  2  2  2  3  3  3  4  4  4  5  5  5  6  6  6  7  7  7  8  8  
8  9
[26]  9  9 10 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 
20 20
```

As you can see, `positions` attempts to evenly distribute the "extras".

```
positions(50, 24)
(0,2)   0 1
(2,4)   2 3
(4,6)   4 5
(6,8)   6 7
(8,10)  8 9
(10,12) 10 11
(12,14) 12 13
(14,16) 14 15
(16,18) 16 17
(18,20) 18 19
(20,22) 20 21
(22,25) 22 23 24
(25,27) 25 26
(27,29) 27 28
(29,31) 29 30
(31,33) 31 32
(33,35) 33 34
(35,37) 35 36
(37,39) 37 38
(39,41) 39 40
(41,43) 41 42
(43,45) 43 44
(45,47) 45 46
(47,50) 47 48 49

 sort(rep(1: 24, each = 1, length.out = 50))
 [1]  1  1  1  2  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10 11 
11 12
[26] 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 
24 24
```

You see if there're only 2, it puts one in the middle and one at the end.

```
positions(50, 22)
(0,2)   0 1
(2,4)   2 3
(4,6)   4 5
(6,9)   6 7 8
(9,11)  9 10
(11,13) 11 12
(13,15) 13 14
(15,18) 15 16 17
(18,20) 18 19
(20,22) 20 21
(22,25) 22 23 24
(25,27) 25 26
(27,29) 27 28
(29,31) 29 30
(31,34) 31 32 33
(34,36) 34 35
(36,38) 36 37
(38,40) 38 39
(40,43) 40 41 42
(43,45) 43 44
(45,47) 45 46
(47,50) 47 48 49

 sort(rep(1: 22, each = 1, length.out = 50))
 [1]  1  1  1  2  2  2  3  3  3  4  4  4  5  5  5  6  6  6  7  7  8  8  9  
9 10
[26] 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 
22 22
```

When there're only a few it is still roughly evenly spaced out.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95909368
  
--- Diff: R/pkg/R/context.R ---
@@ -91,6 +91,13 @@ objectFile <- function(sc, path, minPartitions = NULL) {
 #' will write it to disk and send the file name to JVM. Also to make sure 
each slice is not
 #' larger than that limit, number of slices may be increased.
 #'
+#' In 2.2.0 we are changing how the numSlices are used/computed to handle
+#' 1 < (length(coll) / numSlices) << length(coll) better. This is safe 
because
+#' parallelize() is not exposed publically. In the specific one case that 
it is used to convert
+#' R native object into SparkDataFrame, it has always been keeping it at 
the default of 1.
--- End diff --

nit: keeping it -> kept


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95909985
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Oh I was not referring to computing `numSlices` but commenting on how the 
`positions` function was used to actually perform the split in 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L123
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95910083
  
--- Diff: R/pkg/R/context.R ---
@@ -91,6 +91,13 @@ objectFile <- function(sc, path, minPartitions = NULL) {
 #' will write it to disk and send the file name to JVM. Also to make sure 
each slice is not
 #' larger than that limit, number of slices may be increased.
 #'
+#' In 2.2.0 we are changing how the numSlices are used/computed to handle
+#' 1 < (length(coll) / numSlices) << length(coll) better. This is safe 
because
+#' parallelize() is not exposed publically. In the specific one case that 
it is used to convert
--- End diff --

nit: spelling of publicly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95909536
  
--- Diff: R/pkg/R/context.R ---
@@ -128,12 +135,15 @@ parallelize <- function(sc, coll, numSlices = 1) {
   objectSize <- object.size(coll)
 
   # For large objects we make sure the size of each slice is also smaller 
than sizeLimit
-  numSlices <- max(numSlices, ceiling(objectSize / sizeLimit))
-  if (numSlices > length(coll))
-numSlices <- length(coll)
+  numSerializedSlices <- max(numSlices, ceiling(objectSize / sizeLimit))
+  if (numSerializedSlices > length(coll))
+numSerializedSlices <- length(coll)
+
+  splits <- sort(rep(1: numSerializedSlices, each = 1, length.out = 
length(coll)))
--- End diff --

Can we add a small comment here on what the splits come out as (something 
like what you had in the PR comment)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95874458
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

hmm, I'm not finding any automatic assignment to numSlices in Python or 
ParallelCollectionRDD 
([here](https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L118))

I see [defaultParallelism 
](https://github.com/apache/spark/blob/31da755c80aed8219c368fd18c72b42e50be46fc/core/src/main/scala/org/apache/spark/SparkContext.scala#L2272).
 I thought about be smarter than fixing the default partition number at 1, but
a) it's been this way since Spark 1.3
b) most R data size loaded from the driver is small anyway and might not be 
worthwhile to have more than 1 partition


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95865135
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Can we check how the scala implementation works for this ? I am wondering 
if we can reuse the logic there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95738253
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

thinking more about this, I think it's preferrable to set each = 1 and 
length.out to let it repeat to fill it out, as suppose to the current behavior 
of each = ceiling of some multiplier and then truncate it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95518300
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

How about to add in doc:
"the actual number partition can be increased as multiples of 
spark.r.maxAllocationLimit, or limited by the number of columns in data.frame."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95514830
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

it looks like it is intentional:
https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L115



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95514721
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -186,6 +186,8 @@ getDefaultSqlSource <- function() {
 #'
 #' @param data an RDD or list or data.frame.
 #' @param schema a list of column names or named list (StructType), 
optional.
+#' @param samplingRatio Currently not used.
+#' @param numPartitions the number of partitions of the SparkDataFrame.
--- End diff --

more in the other comment...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95514615
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Hmm, good point, the behavior is a bit strange and I haven't thought of a 
concise way to document this.
https://github.com/apache/spark/blob/master/R/pkg/R/context.R#L131

Basically, it's the largest of numSlices or ceiling of data size divided by 
`spark.r.maxAllocationLimit` - *but* limited by length of the data (but this 
length is wrong if the data is a data.frame - since that length becomes the 
number of columns).

Is this an unintentional behavior (ie. limited always by the number of 
columns even when the data size is larger then the 
`spark.r.maxAllocationLimit`)? I can't tell...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95487087
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -186,6 +186,8 @@ getDefaultSqlSource <- function() {
 #'
 #' @param data an RDD or list or data.frame.
--- End diff --

While we are at it, can we remove the `RDD` bit from this line ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95487308
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -186,6 +186,8 @@ getDefaultSqlSource <- function() {
 #'
 #' @param data an RDD or list or data.frame.
 #' @param schema a list of column names or named list (StructType), 
optional.
+#' @param samplingRatio Currently not used.
+#' @param numPartitions the number of partitions of the SparkDataFrame.
--- End diff --

Related to the comment in the test case it might be good to say what the 
default is here ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-10 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16512#discussion_r95487221
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -196,6 +196,12 @@ test_that("create DataFrame from RDD", {
   expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), 
c("height", "float")))
   expect_equal(as.list(collect(where(df, df$name == "John"))),
list(name = "John", age = 19L, height = 176.5))
+  expect_equal(getNumPartitions(toRDD(df)), 1)
--- End diff --

Is the default always 1 or does it depend on the machine used ? I vaguely 
remember the default being inferred from number of cores available or something 
like that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16512: [SPARK-18335][SPARKR] createDataFrame to support ...

2017-01-08 Thread felixcheung
GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/16512

[SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter

## What changes were proposed in this pull request?

To allow specifying number of partitions when the DataFrame is created

## How was this patch tested?

manual, unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rnumpart

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16512


commit b66a0ac4748bbf14dfb992aeff95028122b6d7a9
Author: Felix Cheung 
Date:   2017-01-09T05:16:39Z

add numPartitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org