There are a few things you can do here:

 - Infer the schema on a subset of the data, pass that inferred schema
(schemaRDD.schema) as the second argument of jsonRDD.
 - Hand construct a schema and pass it as the second argument including the
fields you are interested in.
 - Instead load the data as a table with a single string and use Hive UDFs
to extract the fields you want.

Michael

On Wed, Nov 12, 2014 at 2:05 PM, Corey Nolet <cjno...@gmail.com> wrote:

> I'm loading sequence files containing json blobs in the value,
> transforming them into RDD[String] and then using hiveContext.jsonRDD(). It
> looks like Spark reads the files twice- once when I I define the jsonRDD()
> and then again when I actually make my call to hiveContext.sql().
>
> Looking @ the code- I see an inferSchema() method which gets called under
> the hood. I also see an experimental jsonRDD() method which has a
> sampleRatio.
>
> My dataset is extremely large and i've got a lot of processing to do on
> it- it's really not a luxury to be able to loop through it twice. I also
> know that the SQL I am going to be running matches at least "some" of the
> records contained in the files. Would it make sense or be possible with the
> current execution plan design to be able to bypass inferring the schema for
> purposes of speed?
>
> Though I haven't really dug further in the code than the implementations
> of the client API methods that I'm calling, I am wondering if there's a way
> to theoretically process the data without pre-determining the schema. I
> also don't have the luxury of giving the full schema ahead of time because
> i may want to do a "select * from table" but I may only know 2 or 3 of the
> actual json keys that are available.
>
> Thanks.
>

Reply via email to