On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote:
> Are these mainly in csv format? > Alas, no -- lots of different formats. Many are fixed width files, where I have outside information to know which byte ranges correspond to which columns. Some have odd null representations or non-comma delimiters (though many of those cases might fit within the configurability of the spark-csv package). > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 17 June 2016 at 20:38, Everett Anderson <ever...@nuna.com.invalid> > wrote: > >> Hi, >> >> I have a system with files in a variety of non-standard input formats, >> though they're generally flat text files. I'd like to dynamically create >> DataFrames of string columns. >> >> What's the best way to go from a RDD<String> to a DataFrame of StringType >> columns? >> >> My current plan is >> >> - Call map() on the RDD<String> with a function to split the String >> into columns and call RowFactory.create() with the resulting array, >> creating a RDD<Row> >> - Construct a StructType schema using column names and StringType >> - Call SQLContext.createDataFrame(RDD, schema) to create the result >> >> Does that make sense? >> >> I looked through the spark-csv package a little and noticed that it's >> using baseRelationToDataFrame(), but BaseRelation looks like it might be a >> restricted developer API. Anyone know if it's recommended for use? >> >> Thanks! >> >> - Everett >> >> >