Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the

Re: Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Michael Armbrust
There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data