Hi Antony,

Generally, PCollections are a distributed bag of elements, just like Spark
RDDs (for batch).
Assuming you have a distributed collection, you probably wouldn't want to
materialize it locally, and even if it's a global count (result) of some
kind (guaranteeing to avoid OOM in your "driver") you'd probably want to
write it to a Sink of some kind - Kafka, HDFS, etc.

I'm curious how would you use "collect()" or materializing the PCollection
in the driver program ? what did you have in mind ?

You can implement a custom Sink - Spark runner has it's own ConsoleIO to
print to screen using Spark's print() but I use it for dev iterations and
it clearly works only for the Spark runner.

Amit


On Mon, Feb 20, 2017 at 11:40 AM Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Antony,
>
> The Spark runner deals with caching/persist for you (analyzing how many
> time the same PCollection is used).
>
> For the collect(), I don't fully understand your question.
>
> If if you want to process elements in the PCollection, you can do simple
> ParDo:
>
> .apply(ParDo.of(new DoFn() {
>    @ProcessElement
>    public void processElements(ProcessContext context) {
>      T element = context.element();
>      // do whatever you want
>    }
> })
>
> Is it not what you want ?
>
> Regards
> JB
>
> On 02/20/2017 10:30 AM, Antony Mayi wrote:
> > Hi,
> >
> > what is the best way to fetch content of PCollection to local
> > memory/process (something like calling .collect() on Spark rdd)? Do I
> > need to implement custom Sink?
> >
> > Thanks for any clues,
> > Antony.
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to