Re: PySpark + Streaming + DataFrames
This should be resolved with https://github.com/apache/spark/commit/f92f334ca47c03b980b06cf300aa652d0ffa1880. The conversion no longer does a `.take` when converting from RDD -> DF. On Mon, Oct 19, 2015 at 6:30 PM, Tathagata Das wrote: > Yes, precisely! Also, for other folks who may read this, could reply back > with the trusted conversion that worked for you (for a clear solution)? > > TD > > > On Mon, Oct 19, 2015 at 3:08 PM, Jason White > wrote: > >> Ah, that makes sense then, thanks TD. >> >> The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if >> you provide the schema, so I was avoiding back-and-forth conversions. I’ll >> see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. >> >> -- >> Jason >> >> On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) >> wrote: >> >> RDD and DF are not compatible data types. So you cannot return a DF when >> you have to return an RDD. What rather you can do is return the underlying >> RDD of the dataframe by dataframe.rdd(). >> >> >> On Fri, Oct 16, 2015 at 12:07 PM, Jason White >> wrote: >> >>> Hi Ken, thanks for replying. >>> >>> Unless I'm misunderstanding something, I don't believe that's correct. >>> Dstream.transform() accepts a single argument, func. func should be a >>> function that accepts a single RDD, and returns a single RDD. That's what >>> transform_to_df does, except the RDD it returns is a DF. >>> >>> I've used Dstream.transform() successfully in the past when transforming >>> RDDs, so I don't think my problem is there. >>> >>> I haven't tried this in Scala yet, and all of the examples I've seen on >>> the >>> website seem to use foreach instead of transform. Does this approach >>> work in >>> Scala? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: PySpark + Streaming + DataFrames
Yes, precisely! Also, for other folks who may read this, could reply back with the trusted conversion that worked for you (for a clear solution)? TD On Mon, Oct 19, 2015 at 3:08 PM, Jason White wrote: > Ah, that makes sense then, thanks TD. > > The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if > you provide the schema, so I was avoiding back-and-forth conversions. I’ll > see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. > > -- > Jason > > On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) > wrote: > > RDD and DF are not compatible data types. So you cannot return a DF when > you have to return an RDD. What rather you can do is return the underlying > RDD of the dataframe by dataframe.rdd(). > > > On Fri, Oct 16, 2015 at 12:07 PM, Jason White > wrote: > >> Hi Ken, thanks for replying. >> >> Unless I'm misunderstanding something, I don't believe that's correct. >> Dstream.transform() accepts a single argument, func. func should be a >> function that accepts a single RDD, and returns a single RDD. That's what >> transform_to_df does, except the RDD it returns is a DF. >> >> I've used Dstream.transform() successfully in the past when transforming >> RDDs, so I don't think my problem is there. >> >> I haven't tried this in Scala yet, and all of the examples I've seen on >> the >> website seem to use foreach instead of transform. Does this approach work >> in >> Scala? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: PySpark + Streaming + DataFrames
Ah, that makes sense then, thanks TD. The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. -- Jason On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote: RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason White wrote: Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a DF. I've used Dstream.transform() successfully in the past when transforming RDDs, so I don't think my problem is there. I haven't tried this in Scala yet, and all of the examples I've seen on the website seem to use foreach instead of transform. Does this approach work in Scala? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark + Streaming + DataFrames
RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason White wrote: > Hi Ken, thanks for replying. > > Unless I'm misunderstanding something, I don't believe that's correct. > Dstream.transform() accepts a single argument, func. func should be a > function that accepts a single RDD, and returns a single RDD. That's what > transform_to_df does, except the RDD it returns is a DF. > > I've used Dstream.transform() successfully in the past when transforming > RDDs, so I don't think my problem is there. > > I haven't tried this in Scala yet, and all of the examples I've seen on the > website seem to use foreach instead of transform. Does this approach work > in > Scala? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: PySpark + Streaming + DataFrames
Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a DF. I've used Dstream.transform() successfully in the past when transforming RDDs, so I don't think my problem is there. I haven't tried this in Scala yet, and all of the examples I've seen on the website seem to use foreach instead of transform. Does this approach work in Scala? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org