Got it to work, thanks On Sun, 20 Dec 2015 at 17:01 Eran Witkon <eranwit...@gmail.com> wrote:
> I might be missing you point but I don't get it. > My understanding is that I need a RDD containing Rows but how do I get it? > > I started with a DataFrame > run a map on it and got the RDD [string,string,string,strng] not I want to > convert it back to a DataFrame and failing.... > > Why? > > > On Sun, Dec 20, 2015 at 4:49 PM Ted Yu <yuzhih...@gmail.com> wrote: > >> See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType) >> method: >> >> * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using >> the given schema. >> * It is important to make sure that the structure of every [[Row]] of >> the provided RDD matches >> * the provided schema. Otherwise, there will be runtime exception. >> * Example: >> * {{{ >> * import org.apache.spark.sql._ >> * import org.apache.spark.sql.types._ >> * val sqlContext = new org.apache.spark.sql.SQLContext(sc) >> * >> * val schema = >> * StructType( >> * StructField("name", StringType, false) :: >> * StructField("age", IntegerType, true) :: Nil) >> * >> * val people = >> * sc.textFile("examples/src/main/resources/people.txt").map( >> * _.split(",")).map(p => Row(p(0), p(1).trim.toInt)) >> * val dataFrame = sqlContext.createDataFrame(people, schema) >> * dataFrame.printSchema >> * // root >> * // |-- name: string (nullable = false) >> * // |-- age: integer (nullable = true) >> >> Cheers >> >> On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon <eranwit...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I have an RDD >>> jsonGzip >>> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] = >>> MapPartitionsRDD[8] at map at <console>:65 >>> >>> which I want to convert to a DataFrame with schema >>> so I created a schema: >>> >>> al schema = >>> StructType( >>> StructField("cty", StringType, false) :: >>> StructField("hse", StringType, false) :: >>> StructField("nm", StringType, false) :: >>> StructField("yrs", StringType, false) ::Nil) >>> >>> and called >>> >>> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >>> <console>:36: error: overloaded method value createDataFrame with >>> alternatives: >>> (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: >>> Class[_])org.apache.spark.sql.DataFrame <and> >>> (rdd: org.apache.spark.rdd.RDD[_],beanClass: >>> Class[_])org.apache.spark.sql.DataFrame <and> >>> (rowRDD: >>> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: >>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and> >>> (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: >>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame >>> cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, >>> String)], org.apache.spark.sql.types.StructType) >>> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema) >>> >>> >>> But as you see I don't have the right RDD type. >>> >>> So how cane I get the a dataframe with the right column names? >>> >>> >>> >>