I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().
Looking @ the
There are a few things you can do here:
- Infer the schema on a subset of the data, pass that inferred schema
(schemaRDD.schema) as the second argument of jsonRDD.
- Hand construct a schema and pass it as the second argument including the
fields you are interested in.
- Instead load the data