Re: RDD (Spark dataframe) into a PCollection?

Jan Lukavský Tue, 24 May 2022 08:59:41 -0700

Yes, I suppose it might be more complex than the code snippet, that wasjust to demonstrate the idea. Also the "exportRDD" would probably returnWindowedValue<T> instead of plain T.


On 5/24/22 17:23, Reuven Lax wrote:

Something like this seems reasonable. Beam PCollections also have atimestamp associated with every element, so the importRDD functionprobably needs a way to specify the timestamp (could be an attributename for dataframes or a timestamp extraction function for regular RDDs).


On Tue, May 24, 2022 at 2:40 AM Jan Lukavský <[email protected]> wrote:

    Hi,
    I think this feature is valid. Every runner for which Beam is not
    a 'native' SDK uses some form of translation context, which maps
    PCollection to internal representation of the particular SDK of
    the runner (RDD in this case). It should be possible to "import"
    an RDD into the specific runner via something like

      SparkRunner runner = ....;
      PCollection<...> pCollection = runner.importRDD(rdd);

    and similarly

      RDD<...> rdd = runner.exportRDD(pCollection);

    Yes, apparently this would be runner specific, but that is the
    point, actually. This would enable using features and libraries,
    that Beam does not have, or micro-optimize some particular step
    using runner-specific features, that we don't have in Beam. We
    actually had this feature (at least in a prototype) many years ago
    when Euphoria was a separate project.

     Jan

    On 5/23/22 20:58, Alexey Romanenko wrote:

    On 23 May 2022, at 20:40, Brian Hulette <[email protected]> wrote:

    Yeah I'm not sure of any simple way to do this. I wonder if it's
    worth considering building some Spark runner-specific feature
    around this, or at least packaging up Robert's proposed solution?


    I’m not sure that a runner specific feature is a good way to do
    this since the other runners won’t be able to support it or I’m
    missing something?

    There could be other interesting integrations in this space too,
    e.g. using Spark RDDs as a cache for Interactive Beam.


    Another option could be to add something like SparkIO (or
    FlinkIO/whatever) to read/write data from/to Spark data
    structures for such cases (Spark schema to Beam schema convention
    also could be supported). And dreaming a bit more, for those who
    need to have a mixed pipeline (e.g. Spark + Beam) such connectors
    could support the push-downs of pure Spark pipelines and then use
    the result downstream in Beam.

    —
    Alexey


    Brian

    On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw
    <[email protected]> wrote:

        The easiest way to do this would be to write the RDD
        somewhere then
        read it from Beam.

        On Mon, May 23, 2022 at 9:39 AM Yushu Yao
        <[email protected]> wrote:
        >
        > Hi Folks,
        >
        > I know this is not the optimal way to use beam :-) But
        assume I only use the spark runner.
        >
        > I have a spark library (very complex) that emits a spark
        dataframe (or RDD).
        > I also have an existing complex beam pipeline that can do
        post processing on the data inside the dataframe.
        >
        > However, the beam part needs a pcollection to start with.
        The question is, how can I convert a spark RDD into a
        pcollection?
        >
        > Thanks
        > -Yushu
        >

Re: RDD (Spark dataframe) into a PCollection?

Reply via email to