Re: RDD (Spark dataframe) into a PCollection?

Alexey Romanenko Mon, 23 May 2022 11:58:21 -0700


> On 23 May 2022, at 20:40, Brian Hulette <bhule...@google.com> wrote:
> 
> Yeah I'm not sure of any simple way to do this. I wonder if it's worth 
> considering building some Spark runner-specific feature around this, or at 
> least packaging up Robert's proposed solution?

I’m not sure that a runner specific feature is a good way to do this since the 
other runners won’t be able to support it or I’m missing something?

> There could be other interesting integrations in this space too, e.g. using 
> Spark RDDs as a cache for Interactive Beam.

Another option could be to add something like SparkIO (or FlinkIO/whatever) to 
read/write data from/to Spark data structures for such cases (Spark schema to 
Beam schema convention also could be supported). And dreaming a bit more, for 
those who need to have a mixed pipeline (e.g. Spark + Beam) such connectors 
could support the push-downs of pure Spark pipelines and then use the result 
downstream in Beam.

—
Alexey

> 
> Brian
> 
> On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw <rober...@google.com 
> <mailto:rober...@google.com>> wrote:
> The easiest way to do this would be to write the RDD somewhere then
> read it from Beam.
> 
> On Mon, May 23, 2022 at 9:39 AM Yushu Yao <yao.yu...@gmail.com 
> <mailto:yao.yu...@gmail.com>> wrote:
> >
> > Hi Folks,
> >
> > I know this is not the optimal way to use beam :-) But assume I only use 
> > the spark runner.
> >
> > I have a spark library (very complex) that emits a spark dataframe (or RDD).
> > I also have an existing complex beam pipeline that can do post processing 
> > on the data inside the dataframe.
> >
> > However, the beam part needs a pcollection to start with. The question is, 
> > how can I convert a spark RDD into a pcollection?
> >
> > Thanks
> > -Yushu
> >

Re: RDD (Spark dataframe) into a PCollection?

Reply via email to