Re: Best way to go from RDD to DataFrame of StringType columns

Jason Fri, 17 Jun 2016 15:03:56 -0700

We do the exact same approach you proposed for converting horrible text
formats (VCF in the bioinformatics domain) into DataFrames. This involves
creating the schema dynamically based on the header of the file too.


It's simple and easy, but if you need something higher performance you
might need to look into custom DataSet encoders though I'm not sure what
kind of gain (if any) you'd get with that approach.

Jason

On Fri, Jun 17, 2016, 12:38 PM Everett Anderson <ever...@nuna.com.invalid>
wrote:

> Hi,
>
> I have a system with files in a variety of non-standard input formats,
> though they're generally flat text files. I'd like to dynamically create
> DataFrames of string columns.
>
> What's the best way to go from a RDD<String> to a DataFrame of StringType
> columns?
>
> My current plan is
>
>    - Call map() on the RDD<String> with a function to split the String
>    into columns and call RowFactory.create() with the resulting array,
>    creating a RDD<Row>
>    - Construct a StructType schema using column names and StringType
>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>
> Does that make sense?
>
> I looked through the spark-csv package a little and noticed that it's
> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
> restricted developer API. Anyone know if it's recommended for use?
>
> Thanks!
>
> - Everett
>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Reply via email to