[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812214#comment-15812214 ] Vicente Masip commented on SPARK-11046: --- I'd like to see this last characteristic implemented. Completely necessary for reasons you explain here. What happen when you are working with 50 columns and gapply? Do I rewrite 50 columns with it's new columun from gapply operation? I think there is no alternative. Any suggestions? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613459#comment-15613459 ] Sammie Durugo commented on SPARK-11046: --- I'm not sure anyone has noticed that nested schema cannot be passed to dapply and array type cannot by declared just like you would do with "integer", "string", and "double" when defining a schema using structType. I think it will be useful to be able to declare array type when using dapply as most R outputs take the form of an R list object. For example, suppose the R output takes the following form: output = list(bd = array(..., dim = c(d1, d2, d3), dd = matrix(..., nr, nc), cp = list(a = matrix(..., nr, nc), b = vector(...)) ), in order to define a schema to pass to dapply in the above context, one should have the liberty to define the schema with the following form (if possible): schema = structType(structField("bd", "array"), structField("dd", "array"), structField("cp", structType(structField("a", "array"), structField("b", "double") ) ) ), which may look like this (if possible): StructType |-name = "bd", type = "ArrayType", nullable = TRUE |-name = "dd", type = "ArrayType", nullable = TRUE |-name = "cp", type = "ArrayType", nullable = TRUE |-name = "a", type = "ArrayType", nullable = TRUE |-name = "b", type = "double", nullable = TRUE At the moment, only character type is allowed for data type parameter within structField. But by relaxing this condition and allowing the flexibility to pass in a structType inside an existing structType, the above structure can be accommodated easily. Also, you should allow R list objects, which are very close to R array objects by design, to be mapped into spark's ArrayType. Having to use the default setting of 'schema = NULL' in the dapply, which leaves the output as bytes should be the very last resort. Thank you for your help with this. Sammie. > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046004#comment-15046004 ] Nakul Jindal commented on SPARK-11046: -- [~shivaram], [~sunrui] - Is it ok to depend on / import the [jsonlite|https://cran.r-project.org/web/packages/jsonlite/index.html] package? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046269#comment-15046269 ] Nakul Jindal commented on SPARK-11046: -- I am trying to understand the benefit of doing it using JSON as opposed to the format that it currently is in. We have 3 cases: Case 1 - Leave things the way they are. Here is what we have currently: Let us say, our type is array >> - The R function structField.character (in schema.R) is passed this exact string - In turn it calls checkType to recursively validate the schema string - The scala function SQLUtils.getSQLDataType (in SQLUtils.scala), recursively converts this to an object of type DataType Case 2 - Expect the user to specify the input schema in JSON If we converted the schema format to JSON, it would look like this: { "type": "array", "elementType": { "type": "map", "keyType": "string", "valueType": { "type": "struct", "fields": [{ "name": "a", "type": "integer", "nullable": true, "metadata": {} }, { "name": "b", "type": "long", "nullable": true, "metadata": {} }, { "name": "c", "type": "string", "nullable": true, "metadata": {} }] }, "valueContainsNull": false }, "containsNull": true } (based on what DataType.fromJson expects). which is placing way too much burden on the sparkR user. - I am not entirely sure about this, but I think we do not want to or cannot (or simply haven't implemented) a way to communicate exceptions encountered in the scala code back to R. - We'd need to write a way to validate the JSON schema in R code (or use a JSON parsing library to do it in some way). - The code in SQLUtils.getSQLDataType will now be significantly reduced as we can just call DataType.fromJson. Case 3 - Convert the schema to JSON in R code before calling the JVM function org.apache.spark.sql.api.r.SQLUtils.createStructField - This is essentially moving the work done in SQLUtils.getSQLDataType to R code. This IMHO is significantly more complicated to write and maintain. TLDR: At the cost of inconvenience to the sparkR user, we will convert specifying the schema from its current (IMHO - simple) form to JSON. [~shivaram], [~sunrui] - Any thoughts? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042513#comment-15042513 ] Nakul Jindal commented on SPARK-11046: -- Hi, I am trying to look into this. When you say that SparkR passes a DataFrame schema from R to JVM backend using regular expression, do you mean this format: mapor array Also, is "structField.character" the only function where this "regular expression" format is passed from R to JVM (using org.apache.spark.sql.api.r.SQLUtils", "createDF)? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org