Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Moritz Mack
Hi Yushu, Have a look at org.apache.beam.runners.spark.translation.EvaluationContext in the Spark runner. It maintains that mapping between PCollections and RDDs (wrapped in the BoundedDataset helper). As Reuven just pointed out, values are timestamped (and windowed) in Beam, therefore

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
Yes, I suppose it might be more complex than the code snippet, that was just to demonstrate the idea. Also the "exportRDD" would probably return WindowedValue instead of plain T. On 5/24/22 17:23, Reuven Lax wrote: Something like this seems reasonable. Beam PCollections also have a timestamp

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Yushu Yao
Looks like it's a valid use case. Wondering anyone can give some high level guidelines on how to implement this? I can give it a try. -Yushu On Tue, May 24, 2022 at 2:42 AM Jan Lukavský wrote: > +dev@beam > On 5/24/22 11:40, Jan Lukavský wrote: > > Hi, > I think this feature is valid. Every

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
+dev@beam On 5/24/22 11:40, Jan Lukavský wrote: Hi, I think this feature is valid. Every runner for which Beam is not a 'native' SDK uses some form of translation context, which maps PCollection to internal representation of the particular SDK of the runner

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
Hi, I think this feature is valid. Every runner for which Beam is not a 'native' SDK uses some form of translation context, which maps PCollection to internal representation of the particular SDK of the runner (RDD in this case). It should be possible to "import" an RDD into the specific

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
> On 23 May 2022, at 20:40, Brian Hulette wrote: > > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution? I’m not sure that a runner specific

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Thanks Robert and Brian. As for "writing the RDD somewhere", I can totally write a bunch of files on disk/s3. Any other options? -Yushu On Mon, May 23, 2022 at 11:40 AM Brian Hulette wrote: > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
To add a bit more to what Robert suggested. Right, in general we can’t read Spark RDD directly with Beam (Spark runner uses RDD under the hood but it’s a different story) but you can write the results to any storage and in data format that Beam supports and then read it with a corespondent Beam

RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Hi Folks, I know this is not the optimal way to use beam :-) But assume I only use the spark runner. I have a spark library (very complex) that emits a spark dataframe (or RDD). I also have an existing complex beam pipeline that can do post processing on the data inside the dataframe. However,