Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White
This should be resolved with
https://github.com/apache/spark/commit/f92f334ca47c03b980b06cf300aa652d0ffa1880.
The conversion no longer does a `.take` when converting from RDD -> DF.


On Mon, Oct 19, 2015 at 6:30 PM, Tathagata Das  wrote:

> Yes, precisely! Also, for other folks who may read this, could reply back
> with the trusted conversion that worked for you (for a clear solution)?
>
> TD
>
>
> On Mon, Oct 19, 2015 at 3:08 PM, Jason White 
> wrote:
>
>> Ah, that makes sense then, thanks TD.
>>
>> The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if
>> you provide the schema, so I was avoiding back-and-forth conversions. I’ll
>> see if I can create a ‘trusted’ conversion that doesn’t involve the `take`.
>>
>> --
>> Jason
>>
>> On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com)
>> wrote:
>>
>> RDD and DF are not compatible data types. So you cannot return a DF when
>> you have to return an RDD. What rather you can do is return the underlying
>> RDD of the dataframe by dataframe.rdd().
>>
>>
>> On Fri, Oct 16, 2015 at 12:07 PM, Jason White 
>> wrote:
>>
>>> Hi Ken, thanks for replying.
>>>
>>> Unless I'm misunderstanding something, I don't believe that's correct.
>>> Dstream.transform() accepts a single argument, func. func should be a
>>> function that accepts a single RDD, and returns a single RDD. That's what
>>> transform_to_df does, except the RDD it returns is a DF.
>>>
>>> I've used Dstream.transform() successfully in the past when transforming
>>> RDDs, so I don't think my problem is there.
>>>
>>> I haven't tried this in Scala yet, and all of the examples I've seen on
>>> the
>>> website seem to use foreach instead of transform. Does this approach
>>> work in
>>> Scala?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
Yes, precisely! Also, for other folks who may read this, could reply back
with the trusted conversion that worked for you (for a clear solution)?

TD


On Mon, Oct 19, 2015 at 3:08 PM, Jason White 
wrote:

> Ah, that makes sense then, thanks TD.
>
> The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if
> you provide the schema, so I was avoiding back-and-forth conversions. I’ll
> see if I can create a ‘trusted’ conversion that doesn’t involve the `take`.
>
> --
> Jason
>
> On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com)
> wrote:
>
> RDD and DF are not compatible data types. So you cannot return a DF when
> you have to return an RDD. What rather you can do is return the underlying
> RDD of the dataframe by dataframe.rdd().
>
>
> On Fri, Oct 16, 2015 at 12:07 PM, Jason White 
> wrote:
>
>> Hi Ken, thanks for replying.
>>
>> Unless I'm misunderstanding something, I don't believe that's correct.
>> Dstream.transform() accepts a single argument, func. func should be a
>> function that accepts a single RDD, and returns a single RDD. That's what
>> transform_to_df does, except the RDD it returns is a DF.
>>
>> I've used Dstream.transform() successfully in the past when transforming
>> RDDs, so I don't think my problem is there.
>>
>> I haven't tried this in Scala yet, and all of the examples I've seen on
>> the
>> website seem to use foreach instead of transform. Does this approach work
>> in
>> Scala?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
Ah, that makes sense then, thanks TD.

The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you 
provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I 
can create a ‘trusted’ conversion that doesn’t involve the `take`.

-- 
Jason

On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote:

RDD and DF are not compatible data types. So you cannot return a DF when you 
have to return an RDD. What rather you can do is return the underlying RDD of 
the dataframe by dataframe.rdd(). 


On Fri, Oct 16, 2015 at 12:07 PM, Jason White  wrote:
Hi Ken, thanks for replying.

Unless I'm misunderstanding something, I don't believe that's correct.
Dstream.transform() accepts a single argument, func. func should be a
function that accepts a single RDD, and returns a single RDD. That's what
transform_to_df does, except the RDD it returns is a DF.

I've used Dstream.transform() successfully in the past when transforming
RDDs, so I don't think my problem is there.

I haven't tried this in Scala yet, and all of the examples I've seen on the
website seem to use foreach instead of transform. Does this approach work in
Scala?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
RDD and DF are not compatible data types. So you cannot return a DF when
you have to return an RDD. What rather you can do is return the underlying
RDD of the dataframe by dataframe.rdd().


On Fri, Oct 16, 2015 at 12:07 PM, Jason White 
wrote:

> Hi Ken, thanks for replying.
>
> Unless I'm misunderstanding something, I don't believe that's correct.
> Dstream.transform() accepts a single argument, func. func should be a
> function that accepts a single RDD, and returns a single RDD. That's what
> transform_to_df does, except the RDD it returns is a DF.
>
> I've used Dstream.transform() successfully in the past when transforming
> RDDs, so I don't think my problem is there.
>
> I haven't tried this in Scala yet, and all of the examples I've seen on the
> website seem to use foreach instead of transform. Does this approach work
> in
> Scala?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
Hi Ken, thanks for replying.

Unless I'm misunderstanding something, I don't believe that's correct.
Dstream.transform() accepts a single argument, func. func should be a
function that accepts a single RDD, and returns a single RDD. That's what
transform_to_df does, except the RDD it returns is a DF.

I've used Dstream.transform() successfully in the past when transforming
RDDs, so I don't think my problem is there.

I haven't tried this in Scala yet, and all of the examples I've seen on the
website seem to use foreach instead of transform. Does this approach work in
Scala?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org