Agree with Amit. Not sure it would be portable to provide such function.

Regards
JB

On 02/20/2017 11:04 AM, Amit Sela wrote:
Spark runner's EvaluationContext
<https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/EvaluationContext.java#L201>
has a hook ready for this - but clearly only for batch, in streaming
this feature doesn't seem relevant.

You can easily hack this in the Spark runner, but for Beam in general I
wonder how this would work in a runner-agnostic way ?
Spark has a driver process, not sure how this works for other runners.

On Mon, Feb 20, 2017 at 11:54 AM Antony Mayi <[email protected]
<mailto:[email protected]>> wrote:

    Thanks Jean,

    my point is to retrieve the data represented let say as
    PCollection<String> to a List<String> (not
    PCollection<List<String>>) - essentially fetching it all to address
    space of the local driver process (this is what Spark's .collect()
    does). It would be a reverse of the beam.sdks.transforms.Create -
    which takes a local iterable and distributes it into PCollection -
    at the end of the pipeline I would like to get the result back to
    single iterable (hence I assuming it would need some type of Sink).

    Thanks,
    Antony.


    On Monday, 20 February 2017, 10:40, Jean-Baptiste Onofré
    <[email protected] <mailto:[email protected]>> wrote:


    Hi Antony,

    The Spark runner deals with caching/persist for you (analyzing how many
    time the same PCollection is used).

    For the collect(), I don't fully understand your question.

    If if you want to process elements in the PCollection, you can do
    simple
    ParDo:

    .apply(ParDo.of(new DoFn() {
      @ProcessElement
      public void processElements(ProcessContext context) {
        T element = context.element();
        // do whatever you want
      }
    })

    Is it not what you want ?

    Regards
    JB

    On 02/20/2017 10:30 AM, Antony Mayi wrote:
    > Hi,
    >
    > what is the best way to fetch content of PCollection to local
    > memory/process (something like calling .collect() on Spark rdd)? Do I
    > need to implement custom Sink?
    >
    > Thanks for any clues,
    > Antony.


    --
    Jean-Baptiste Onofré
    [email protected] <mailto:[email protected]>
    http://blog.nanthrax.net <http://blog.nanthrax.net/>
    Talend - http://www.talend.com <http://www.talend.com/>




--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to