> On 23 May 2022, at 20:40, Brian Hulette <bhule...@google.com> wrote: > > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution?
I’m not sure that a runner specific feature is a good way to do this since the other runners won’t be able to support it or I’m missing something? > There could be other interesting integrations in this space too, e.g. using > Spark RDDs as a cache for Interactive Beam. Another option could be to add something like SparkIO (or FlinkIO/whatever) to read/write data from/to Spark data structures for such cases (Spark schema to Beam schema convention also could be supported). And dreaming a bit more, for those who need to have a mixed pipeline (e.g. Spark + Beam) such connectors could support the push-downs of pure Spark pipelines and then use the result downstream in Beam. — Alexey > > Brian > > On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw <rober...@google.com > <mailto:rober...@google.com>> wrote: > The easiest way to do this would be to write the RDD somewhere then > read it from Beam. > > On Mon, May 23, 2022 at 9:39 AM Yushu Yao <yao.yu...@gmail.com > <mailto:yao.yu...@gmail.com>> wrote: > > > > Hi Folks, > > > > I know this is not the optimal way to use beam :-) But assume I only use > > the spark runner. > > > > I have a spark library (very complex) that emits a spark dataframe (or RDD). > > I also have an existing complex beam pipeline that can do post processing > > on the data inside the dataframe. > > > > However, the beam part needs a pcollection to start with. The question is, > > how can I convert a spark RDD into a pcollection? > > > > Thanks > > -Yushu > >