There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data as a table with a single string and use Hive UDFs to extract the fields you want.
Michael On Wed, Nov 12, 2014 at 2:05 PM, Corey Nolet <cjno...@gmail.com> wrote: > I'm loading sequence files containing json blobs in the value, > transforming them into RDD[String] and then using hiveContext.jsonRDD(). It > looks like Spark reads the files twice- once when I I define the jsonRDD() > and then again when I actually make my call to hiveContext.sql(). > > Looking @ the code- I see an inferSchema() method which gets called under > the hood. I also see an experimental jsonRDD() method which has a > sampleRatio. > > My dataset is extremely large and i've got a lot of processing to do on > it- it's really not a luxury to be able to loop through it twice. I also > know that the SQL I am going to be running matches at least "some" of the > records contained in the files. Would it make sense or be possible with the > current execution plan design to be able to bypass inferring the schema for > purposes of speed? > > Though I haven't really dug further in the code than the implementations > of the client API methods that I'm calling, I am wondering if there's a way > to theoretically process the data without pre-determining the schema. I > also don't have the luxury of giving the full schema ahead of time because > i may want to do a "select * from table" but I may only know 2 or 3 of the > actual json keys that are available. > > Thanks. >