Thanks Jean,
my point is to retrieve the data represented let say as PCollection<String> to
a List<String> (not PCollection<List<String>>) - essentially fetching it all to
address space of the local driver process (this is what Spark's .collect()
does). It would be a reverse of the beam.sdks.transforms.Create - which takes a
local iterable and distributes it into PCollection - at the end of the pipeline
I would like to get the result back to single iterable (hence I assuming it
would need some type of Sink).
Thanks,Antony.
On Monday, 20 February 2017, 10:40, Jean-Baptiste Onofré
<[email protected]> wrote:
Hi Antony,
The Spark runner deals with caching/persist for you (analyzing how many
time the same PCollection is used).
For the collect(), I don't fully understand your question.
If if you want to process elements in the PCollection, you can do simple
ParDo:
.apply(ParDo.of(new DoFn() {
@ProcessElement
public void processElements(ProcessContext context) {
T element = context.element();
// do whatever you want
}
})
Is it not what you want ?
Regards
JB
On 02/20/2017 10:30 AM, Antony Mayi wrote:
> Hi,
>
> what is the best way to fetch content of PCollection to local
> memory/process (something like calling .collect() on Spark rdd)? Do I
> need to implement custom Sink?
>
> Thanks for any clues,
> Antony.
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com