Thanks Jean,
my point is to retrieve the data represented let say as PCollection<String> to 
a List<String> (not PCollection<List<String>>) - essentially fetching it all to 
address space of the local driver process (this is what Spark's .collect() 
does). It would be a reverse of the beam.sdks.transforms.Create - which takes a 
local iterable and distributes it into PCollection - at the end of the pipeline 
I would like to get the result back to single iterable (hence I assuming it 
would need some type of Sink).
Thanks,Antony. 

    On Monday, 20 February 2017, 10:40, Jean-Baptiste Onofré 
<[email protected]> wrote:
 

 Hi Antony,

The Spark runner deals with caching/persist for you (analyzing how many 
time the same PCollection is used).

For the collect(), I don't fully understand your question.

If if you want to process elements in the PCollection, you can do simple 
ParDo:

.apply(ParDo.of(new DoFn() {
  @ProcessElement
  public void processElements(ProcessContext context) {
    T element = context.element();
    // do whatever you want
  }
})

Is it not what you want ?

Regards
JB

On 02/20/2017 10:30 AM, Antony Mayi wrote:
> Hi,
>
> what is the best way to fetch content of PCollection to local
> memory/process (something like calling .collect() on Spark rdd)? Do I
> need to implement custom Sink?
>
> Thanks for any clues,
> Antony.

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


   

Reply via email to