[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613459#comment-15613459
 ] 

Sammie Durugo commented on SPARK-11046:
---------------------------------------

I'm not sure anyone has noticed that nested schema cannot be passed to dapply 
and array type cannot by declared just like you would do with "integer", 
"string", and "double" when defining a schema using structType. I think it will 
be useful to be able to declare array type when using dapply as most R outputs 
take the form of an R list object. For example, suppose the R output takes the 
following form:

output = list(bd = array(..., dim = c(d1, d2, d3), 
                    dd = matrix(..., nr, nc), 
                    cp = list(a = matrix(..., nr, nc), 
                                  b = vector(...)) ),

in order to define a schema to pass to dapply in the above context, one should 
have the liberty to define the schema with the following form (if possible):

schema = structType(structField("bd", "array"), 
                                   structField("dd", "array"), 
                                   structField("cp", 
structType(structField("a", "array"), 
                                                                               
structField("b", "double") ) ) ),

which may look like this (if possible):

StructType
|-name = "bd", type = "ArrayType", nullable = TRUE
|-name = "dd", type = "ArrayType", nullable = TRUE
|-name = "cp", type = "ArrayType", nullable = TRUE
               |-name = "a", type = "ArrayType", nullable = TRUE
               |-name = "b", type = "double", nullable = TRUE

At the moment, only character type is allowed for data type parameter within 
structField. But by relaxing this condition and allowing the flexibility to 
pass in a structType inside an existing structType, the above structure can be 
accommodated easily. Also, you should allow R list objects, which are very 
close to R array objects by design, to be mapped into spark's ArrayType.

Having to use the default setting of 'schema = NULL' in the dapply, which 
leaves the output as bytes should be the very last resort. Thank you for your 
help with this.

Sammie.

> Pass schema from R to JVM using JSON format
> -------------------------------------------
>
>                 Key: SPARK-11046
>                 URL: https://issues.apache.org/jira/browse/SPARK-11046
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 1.5.1
>            Reporter: Sun Rui
>            Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to