Looks like it's a valid use case. Wondering anyone can give some high level guidelines on how to implement this? I can give it a try.
-Yushu On Tue, May 24, 2022 at 2:42 AM Jan Lukavský <[email protected]> wrote: > +dev@beam <[email protected]> > On 5/24/22 11:40, Jan Lukavský wrote: > > Hi, > I think this feature is valid. Every runner for which Beam is not a > 'native' SDK uses some form of translation context, which maps PCollection > to internal representation of the particular SDK of the runner (RDD in this > case). It should be possible to "import" an RDD into the specific runner > via something like > > SparkRunner runner = ....; > PCollection<...> pCollection = runner.importRDD(rdd); > > and similarly > > RDD<...> rdd = runner.exportRDD(pCollection); > > Yes, apparently this would be runner specific, but that is the point, > actually. This would enable using features and libraries, that Beam does > not have, or micro-optimize some particular step using runner-specific > features, that we don't have in Beam. We actually had this feature (at > least in a prototype) many years ago when Euphoria was a separate project. > > Jan > On 5/23/22 20:58, Alexey Romanenko wrote: > > > > On 23 May 2022, at 20:40, Brian Hulette <[email protected]> wrote: > > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution? > > > I’m not sure that a runner specific feature is a good way to do this since > the other runners won’t be able to support it or I’m missing something? > > There could be other interesting integrations in this space too, e.g. > using Spark RDDs as a cache for Interactive Beam. > > > Another option could be to add something like SparkIO (or > FlinkIO/whatever) to read/write data from/to Spark data structures for such > cases (Spark schema to Beam schema convention also could be supported). And > dreaming a bit more, for those who need to have a mixed pipeline (e.g. > Spark + Beam) such connectors could support the push-downs of pure Spark > pipelines and then use the result downstream in Beam. > > — > Alexey > > > > Brian > > On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw <[email protected]> > wrote: > >> The easiest way to do this would be to write the RDD somewhere then >> read it from Beam. >> >> On Mon, May 23, 2022 at 9:39 AM Yushu Yao <[email protected]> wrote: >> > >> > Hi Folks, >> > >> > I know this is not the optimal way to use beam :-) But assume I only >> use the spark runner. >> > >> > I have a spark library (very complex) that emits a spark dataframe (or >> RDD). >> > I also have an existing complex beam pipeline that can do post >> processing on the data inside the dataframe. >> > >> > However, the beam part needs a pcollection to start with. The question >> is, how can I convert a spark RDD into a pcollection? >> > >> > Thanks >> > -Yushu >> > >> > >
