jsonRdd and MapType

2014-11-07 Thread boclair
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). 
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.

For example:
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 val jsonRdd = sc.parallelize(Seq({key: 1234, attributes:
 {gender: m}},
   {key: 4321,
attributes: {location: nyc}}))
 val schemaRdd = sqlContext.jsonRDD(jsonRdd)
 schemaRdd.printSchema
root
 |-- attributes: struct (nullable = true)
 ||-- gender: string (nullable = true)
 ||-- location: string (nullable = true)
 |-- key: string (nullable = true)
 schemaRdd.collect
res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
[[null,nyc],4321])


However this isn't what I want.  So I created my own StructType to pass to
the jsonRDD call:

 import org.apache.spark.sql._
 val st = StructType(Seq(StructField(key, StringType, false),
   StructField(attributes,
MapType(StringType, StringType, false
 val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes:
 {gender: m}},
  {key: 4321,
attributes: {location: nyc}}))
 val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
 schemaRddSt.printSchema
root
 |-- key: string (nullable = false)
 |-- attributes: map (nullable = true)
 ||-- key: string
 ||-- value: string (valueContainsNull = false)
 schemaRddSt.collect
***  Failure  ***
scala.MatchError: MapType(StringType,StringType,false) (of class
org.apache.spark.sql.catalyst.types.MapType)
at 
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
...

The schema of the schemaRDD is correct.  But it seems that the json cannot
be coerced to a MapType.  I can see at the line in the stack trace that
there is no case statement for MapType.  Is there something I'm missing?  Is
this a bug or decision to not support MapType with json?

Thanks,
Brian




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian,

Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.

Thanks,

Yin

On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote:

 I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
 I'd like some of the json fields to be in a MapType rather than a sub
 StructType, as the keys will be very sparse.

 For example:
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.createSchemaRDD
  val jsonRdd = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
{key: 4321,
 attributes: {location: nyc}}))
  val schemaRdd = sqlContext.jsonRDD(jsonRdd)
  schemaRdd.printSchema
 root
  |-- attributes: struct (nullable = true)
  ||-- gender: string (nullable = true)
  ||-- location: string (nullable = true)
  |-- key: string (nullable = true)
  schemaRdd.collect
 res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
 [[null,nyc],4321])


 However this isn't what I want.  So I created my own StructType to pass to
 the jsonRDD call:

  import org.apache.spark.sql._
  val st = StructType(Seq(StructField(key, StringType, false),
StructField(attributes,
 MapType(StringType, StringType, false
  val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
   {key: 4321,
 attributes: {location: nyc}}))
  val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
  schemaRddSt.printSchema
 root
  |-- key: string (nullable = false)
  |-- attributes: map (nullable = true)
  ||-- key: string
  ||-- value: string (valueContainsNull = false)
  schemaRddSt.collect
 ***  Failure  ***
 scala.MatchError: MapType(StringType,StringType,false) (of class
 org.apache.spark.sql.catalyst.types.MapType)
 at
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
 ...

 The schema of the schemaRDD is correct.  But it seems that the json cannot
 be coerced to a MapType.  I can see at the line in the stack trace that
 there is no case statement for MapType.  Is there something I'm missing?
 Is
 this a bug or decision to not support MapType with json?

 Thanks,
 Brian




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org