[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2017-01-09 Thread Vicente Masip (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812214#comment-15812214
 ] 

Vicente Masip commented on SPARK-11046:
---

I'd like to see this last characteristic implemented. Completely necessary for 
reasons you explain here. What happen when you are working with 50 columns and 
gapply? Do I rewrite 50 columns with it's new columun from gapply operation? I 
think there is no alternative. Any suggestions?

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2016-10-27 Thread Sammie Durugo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613459#comment-15613459
 ] 

Sammie Durugo commented on SPARK-11046:
---

I'm not sure anyone has noticed that nested schema cannot be passed to dapply 
and array type cannot by declared just like you would do with "integer", 
"string", and "double" when defining a schema using structType. I think it will 
be useful to be able to declare array type when using dapply as most R outputs 
take the form of an R list object. For example, suppose the R output takes the 
following form:

output = list(bd = array(..., dim = c(d1, d2, d3), 
dd = matrix(..., nr, nc), 
cp = list(a = matrix(..., nr, nc), 
  b = vector(...)) ),

in order to define a schema to pass to dapply in the above context, one should 
have the liberty to define the schema with the following form (if possible):

schema = structType(structField("bd", "array"), 
   structField("dd", "array"), 
   structField("cp", 
structType(structField("a", "array"), 
   
structField("b", "double") ) ) ),

which may look like this (if possible):

StructType
|-name = "bd", type = "ArrayType", nullable = TRUE
|-name = "dd", type = "ArrayType", nullable = TRUE
|-name = "cp", type = "ArrayType", nullable = TRUE
   |-name = "a", type = "ArrayType", nullable = TRUE
   |-name = "b", type = "double", nullable = TRUE

At the moment, only character type is allowed for data type parameter within 
structField. But by relaxing this condition and allowing the flexibility to 
pass in a structType inside an existing structType, the above structure can be 
accommodated easily. Also, you should allow R list objects, which are very 
close to R array objects by design, to be mapped into spark's ArrayType.

Having to use the default setting of 'schema = NULL' in the dapply, which 
leaves the output as bytes should be the very last resort. Thank you for your 
help with this.

Sammie.

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2015-12-07 Thread Nakul Jindal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046004#comment-15046004
 ] 

Nakul Jindal commented on SPARK-11046:
--

[~shivaram], [~sunrui] - Is it ok to depend on / import the 
[jsonlite|https://cran.r-project.org/web/packages/jsonlite/index.html] package?


> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2015-12-07 Thread Nakul Jindal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046269#comment-15046269
 ] 

Nakul Jindal commented on SPARK-11046:
--

I am trying to understand the benefit of doing it using JSON as opposed to the 
format that it currently is in.

We have 3 cases:


Case 1 - Leave things the way they are.
Here is what we have currently:
Let us say, our type is 
array  >>

- The R function structField.character (in schema.R) is passed this exact string
- In turn it calls checkType to recursively validate the schema string
- The scala function SQLUtils.getSQLDataType (in SQLUtils.scala), recursively 
converts this to an object of type DataType

Case 2 - Expect the user to specify the input schema in JSON
If we converted the schema format to JSON, it would look like this:
{
  "type": "array",
  "elementType": {
"type": "map",
"keyType": "string",
"valueType": {
  "type": "struct",
  "fields": [{
"name": "a",
"type": "integer",
"nullable": true,
"metadata": {}
  }, {
"name": "b",
"type": "long",
"nullable": true,
"metadata": {}
  }, {
"name": "c",
"type": "string",
"nullable": true,
"metadata": {}
  }]
},
"valueContainsNull": false
  },
  "containsNull": true
}
(based on what DataType.fromJson expects).
which is placing way too much burden on the sparkR user.

- I am not entirely sure about this, but I think we do not want to or cannot 
(or simply haven't implemented) a way to communicate exceptions encountered in 
the scala code back to R.
- We'd need to write a way to validate the JSON schema in R code (or use a JSON 
parsing library to do it in some way).
- The code in SQLUtils.getSQLDataType will now be significantly reduced as we 
can just call DataType.fromJson.

Case 3 - Convert the schema to JSON in R code before calling the JVM function 
org.apache.spark.sql.api.r.SQLUtils.createStructField
- This is essentially moving the work done in SQLUtils.getSQLDataType to R 
code. This IMHO is significantly more complicated to write and maintain.

TLDR: At the cost of inconvenience to the sparkR user, we will convert 
specifying the schema from its current (IMHO - simple) form to JSON.

[~shivaram], [~sunrui] - Any thoughts?


> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2015-12-04 Thread Nakul Jindal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042513#comment-15042513
 ] 

Nakul Jindal commented on SPARK-11046:
--

Hi, I am trying to look into this. 
When you say that SparkR passes a DataFrame schema from R to JVM backend using 
regular expression, do you mean this format:

map
or
array

Also, is "structField.character" the only function where this "regular 
expression" format is passed from R to JVM (using 
org.apache.spark.sql.api.r.SQLUtils", "createDF)?

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org