Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
Got it to work, thanks
On Sun, 20 Dec 2015 at 17:01 Eran Witkon  wrote:

> I might be missing you point but I don't get it.
> My understanding is that I need a RDD containing Rows but how do I get it?
>
> I started with a DataFrame
> run a map on it and got the RDD [string,string,string,strng] not I want to
> convert it back to a DataFrame and failing
>
> Why?
>
>
> On Sun, Dec 20, 2015 at 4:49 PM Ted Yu  wrote:
>
>> See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType)
>> method:
>>
>>* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using
>> the given schema.
>>* It is important to make sure that the structure of every [[Row]] of
>> the provided RDD matches
>>* the provided schema. Otherwise, there will be runtime exception.
>>* Example:
>>* {{{
>>*  import org.apache.spark.sql._
>>*  import org.apache.spark.sql.types._
>>*  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>*
>>*  val schema =
>>*StructType(
>>*  StructField("name", StringType, false) ::
>>*  StructField("age", IntegerType, true) :: Nil)
>>*
>>*  val people =
>>*sc.textFile("examples/src/main/resources/people.txt").map(
>>*  _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
>>*  val dataFrame = sqlContext.createDataFrame(people, schema)
>>*  dataFrame.printSchema
>>*  // root
>>*  // |-- name: string (nullable = false)
>>*  // |-- age: integer (nullable = true)
>>
>> Cheers
>>
>> On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon 
>> wrote:
>>
>>> Hi,
>>>
>>> I have an RDD
>>> jsonGzip
>>> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] =
>>> MapPartitionsRDD[8] at map at :65
>>>
>>> which I want to convert to a DataFrame with schema
>>> so I created a schema:
>>>
>>> al schema =
>>>   StructType(
>>> StructField("cty", StringType, false) ::
>>>   StructField("hse", StringType, false) ::
>>> StructField("nm", StringType, false) ::
>>>   StructField("yrs", StringType, false) ::Nil)
>>>
>>> and called
>>>
>>> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
>>> :36: error: overloaded method value createDataFrame with 
>>> alternatives:
>>>   (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: 
>>> Class[_])org.apache.spark.sql.DataFrame 
>>>   (rdd: org.apache.spark.rdd.RDD[_],beanClass: 
>>> Class[_])org.apache.spark.sql.DataFrame 
>>>   (rowRDD: 
>>> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: 
>>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame 
>>>   (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: 
>>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
>>>  cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, 
>>> String)], org.apache.spark.sql.types.StructType)
>>>val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
>>>
>>>
>>> But as you see I don't have the right RDD type.
>>>
>>> So how cane I get the a dataframe with the right column names?
>>>
>>>
>>>
>>


Re: How to convert and RDD to DF?

2015-12-20 Thread Eran Witkon
I might be missing you point but I don't get it.
My understanding is that I need a RDD containing Rows but how do I get it?

I started with a DataFrame
run a map on it and got the RDD [string,string,string,strng] not I want to
convert it back to a DataFrame and failing

Why?


On Sun, Dec 20, 2015 at 4:49 PM Ted Yu  wrote:

> See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType)
> method:
>
>* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the
> given schema.
>* It is important to make sure that the structure of every [[Row]] of
> the provided RDD matches
>* the provided schema. Otherwise, there will be runtime exception.
>* Example:
>* {{{
>*  import org.apache.spark.sql._
>*  import org.apache.spark.sql.types._
>*  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>*
>*  val schema =
>*StructType(
>*  StructField("name", StringType, false) ::
>*  StructField("age", IntegerType, true) :: Nil)
>*
>*  val people =
>*sc.textFile("examples/src/main/resources/people.txt").map(
>*  _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
>*  val dataFrame = sqlContext.createDataFrame(people, schema)
>*  dataFrame.printSchema
>*  // root
>*  // |-- name: string (nullable = false)
>*  // |-- age: integer (nullable = true)
>
> Cheers
>
> On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon  wrote:
>
>> Hi,
>>
>> I have an RDD
>> jsonGzip
>> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] =
>> MapPartitionsRDD[8] at map at :65
>>
>> which I want to convert to a DataFrame with schema
>> so I created a schema:
>>
>> al schema =
>>   StructType(
>> StructField("cty", StringType, false) ::
>>   StructField("hse", StringType, false) ::
>> StructField("nm", StringType, false) ::
>>   StructField("yrs", StringType, false) ::Nil)
>>
>> and called
>>
>> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
>> :36: error: overloaded method value createDataFrame with 
>> alternatives:
>>   (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: 
>> Class[_])org.apache.spark.sql.DataFrame 
>>   (rdd: org.apache.spark.rdd.RDD[_],beanClass: 
>> Class[_])org.apache.spark.sql.DataFrame 
>>   (rowRDD: 
>> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: 
>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame 
>>   (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: 
>> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
>>  cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, 
>> String)], org.apache.spark.sql.types.StructType)
>>val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
>>
>>
>> But as you see I don't have the right RDD type.
>>
>> So how cane I get the a dataframe with the right column names?
>>
>>
>>
>


Re: How to convert and RDD to DF?

2015-12-20 Thread Ted Yu
See the comment for createDataFrame(rowRDD: RDD[Row], schema: StructType)
method:

   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the
given schema.
   * It is important to make sure that the structure of every [[Row]] of
the provided RDD matches
   * the provided schema. Otherwise, there will be runtime exception.
   * Example:
   * {{{
   *  import org.apache.spark.sql._
   *  import org.apache.spark.sql.types._
   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
   *
   *  val schema =
   *StructType(
   *  StructField("name", StringType, false) ::
   *  StructField("age", IntegerType, true) :: Nil)
   *
   *  val people =
   *sc.textFile("examples/src/main/resources/people.txt").map(
   *  _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
   *  val dataFrame = sqlContext.createDataFrame(people, schema)
   *  dataFrame.printSchema
   *  // root
   *  // |-- name: string (nullable = false)
   *  // |-- age: integer (nullable = true)

Cheers

On Sun, Dec 20, 2015 at 6:31 AM, Eran Witkon  wrote:

> Hi,
>
> I have an RDD
> jsonGzip
> res3: org.apache.spark.rdd.RDD[(String, String, String, String)] =
> MapPartitionsRDD[8] at map at :65
>
> which I want to convert to a DataFrame with schema
> so I created a schema:
>
> al schema =
>   StructType(
> StructField("cty", StringType, false) ::
>   StructField("hse", StringType, false) ::
> StructField("nm", StringType, false) ::
>   StructField("yrs", StringType, false) ::Nil)
>
> and called
>
> val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
> :36: error: overloaded method value createDataFrame with 
> alternatives:
>   (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: 
> Class[_])org.apache.spark.sql.DataFrame 
>   (rdd: org.apache.spark.rdd.RDD[_],beanClass: 
> Class[_])org.apache.spark.sql.DataFrame 
>   (rowRDD: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: 
> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame 
>   (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: 
> org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
>  cannot be applied to (org.apache.spark.rdd.RDD[(String, String, String, 
> String)], org.apache.spark.sql.types.StructType)
>val unzipJSON = sqlContext.createDataFrame(jsonGzip,schema)
>
>
> But as you see I don't have the right RDD type.
>
> So how cane I get the a dataframe with the right column names?
>
>
>