Re: How to convert and RDD to DF?
Got it to work, thanks On Sun, 20 Dec 2015 at 17:01 Eran Witkon wrote: > I might be missing you point but I don't get it. > My understanding is that I need a RDD containing Rows but how do I get it? > > I started with a DataFrame > run a map on it and got the RDD [string,string,string,strng] not I want to > convert it back to a DataFrame and failing > > Why? > > > On Sun, Dec 20, 2015 at 4:49 PM Ted Yu wrote: > >> See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) >> method: >> >>* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using >> the given schema. >>* It is important to make sure that the structure of every [[Row]] of >> the provided RDD matches >>* the provided schema. Otherwise, there will be runtime exception. >>* Example: >>* {{{ >>* import org.apache.spark.sql._ >>* import org.apache.spark.sql.types._ >>* val sqlContext = new org.apache.spark.sql.SQLContext(sc) >>* >>* val schema = >>*StructType( >>* StructField("name", StringType, false) :: >>* StructField("age", IntegerType, true) :: Nil) >>* >>* val people = >>*sc.textFile("examples/src/main/resources/people.txt").map( >>* _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) >>* val dataFrame = sqlContext.createDataFrame(people, schema) >>* dataFrame.printSchema >>* // root >>* // |-- name: string (nullable = false) >>* // |-- age: integer (nullable = true) >> >> Cheers >> >> On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon >> wrote: >> >>> Hi, >>> >>> I have an RDD >>> jsonGzip >>> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = >>> MapPartitionsRDD[8] at map at :65 >>> >>> which I want to convert to a DataFrame with schema >>> so I created a schema: >>> >>> al schema = >>> StructType( >>> StructField("cty", StringType, false) :: >>> StructField("hse", StringType, false) :: >>> StructField("nm", StringType, false) :: >>> StructField("yrs", StringType, false) ::Nil) >>> >>> and called >>> >>> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >>> :36: error: overloaded method value createDataFrame with >>> alternatives: >>> (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: >>> Class[_])org.apache.spark.sql.DataFrame >>> (rdd: org.apache.spark.rdd.RDD[_],beanClass: >>> Class[_])org.apache.spark.sql.DataFrame >>> (rowRDD: >>> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: >>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame >>> (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: >>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame >>> cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, >>> String)], org.apache.spark.sql.types.StructType) >>>val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >>> >>> >>> But as you see I don't have the right RDD type. >>> >>> So how cane I get the a dataframe with the right column names? >>> >>> >>> >>
Re: How to convert and RDD to DF?
I might be missing you point but I don't get it. My understanding is that I need a RDD containing Rows but how do I get it? I started with a DataFrame run a map on it and got the RDD [string,string,string,strng] not I want to convert it back to a DataFrame and failing Why? On Sun, Dec 20, 2015 at 4:49 PM Ted Yu wrote: > See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) > method: > >* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the > given schema. >* It is important to make sure that the structure of every [[Row]] of > the provided RDD matches >* the provided schema. Otherwise, there will be runtime exception. >* Example: >* {{{ >* import org.apache.spark.sql._ >* import org.apache.spark.sql.types._ >* val sqlContext = new org.apache.spark.sql.SQLContext(sc) >* >* val schema = >*StructType( >* StructField("name", StringType, false) :: >* StructField("age", IntegerType, true) :: Nil) >* >* val people = >*sc.textFile("examples/src/main/resources/people.txt").map( >* _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) >* val dataFrame = sqlContext.createDataFrame(people, schema) >* dataFrame.printSchema >* // root >* // |-- name: string (nullable = false) >* // |-- age: integer (nullable = true) > > Cheers > > On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon wrote: > >> Hi, >> >> I have an RDD >> jsonGzip >> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = >> MapPartitionsRDD[8] at map at :65 >> >> which I want to convert to a DataFrame with schema >> so I created a schema: >> >> al schema = >> StructType( >> StructField("cty", StringType, false) :: >> StructField("hse", StringType, false) :: >> StructField("nm", StringType, false) :: >> StructField("yrs", StringType, false) ::Nil) >> >> and called >> >> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >> :36: error: overloaded method value createDataFrame with >> alternatives: >> (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: >> Class[_])org.apache.spark.sql.DataFrame >> (rdd: org.apache.spark.rdd.RDD[_],beanClass: >> Class[_])org.apache.spark.sql.DataFrame >> (rowRDD: >> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: >> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame >> (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: >> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame >> cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, >> String)], org.apache.spark.sql.types.StructType) >>val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >> >> >> But as you see I don't have the right RDD type. >> >> So how cane I get the a dataframe with the right column names? >> >> >> >
Re: How to convert and RDD to DF?
See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) method: * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the given schema. * It is important to make sure that the structure of every [[Row]] of the provided RDD matches * the provided schema. Otherwise, there will be runtime exception. * Example: * {{{ * import org.apache.spark.sql._ * import org.apache.spark.sql.types._ * val sqlContext = new org.apache.spark.sql.SQLContext(sc) * * val schema = *StructType( * StructField("name", StringType, false) :: * StructField("age", IntegerType, true) :: Nil) * * val people = *sc.textFile("examples/src/main/resources/people.txt").map( * _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) * val dataFrame = sqlContext.createDataFrame(people, schema) * dataFrame.printSchema * // root * // |-- name: string (nullable = false) * // |-- age: integer (nullable = true) Cheers On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon wrote: > Hi, > > I have an RDD > jsonGzip > res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = > MapPartitionsRDD[8] at map at :65 > > which I want to convert to a DataFrame with schema > so I created a schema: > > al schema = > StructType( > StructField("cty", StringType, false) :: > StructField("hse", StringType, false) :: > StructField("nm", StringType, false) :: > StructField("yrs", StringType, false) ::Nil) > > and called > > val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) > :36: error: overloaded method value createDataFrame with > alternatives: > (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: > Class[_])org.apache.spark.sql.DataFrame > (rdd: org.apache.spark.rdd.RDD[_],beanClass: > Class[_])org.apache.spark.sql.DataFrame > (rowRDD: > org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: > org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame > (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: > org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame > cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, > String)], org.apache.spark.sql.types.StructType) >val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) > > > But as you see I don't have the right RDD type. > > So how cane I get the a dataframe with the right column names? > > >