GroupByKey is generating fake duplicates on Dataflow

2020-12-18 Thread bits horoscope
Hi Apache Beam community, I have been dealing with a bug in a GroupByKey
step.

I'm reading an XML file with many info, something like this.



ABC-37717
First Listing


ABC-37718
Second listing


ABC-37719
Third listing



I want to work only with the listings with unique code and discard the
duplicate ones. I have checked the input and all the codes are different,
however, my Dataflow pipeline is considering as duplicates many listings
(which are not true, because all are distinct). I have read about the
shards and the stuff that dataflow does in the cloud, so maybe the
windowing is considering the same element and then mark it as duplicate.
But I don't know how to correct it. What would you recommend to me?

This is the code of the pipeline

*final* TupleTag tagCodeUnique = *new*
TupleTag() {};

*final* TupleTag tagCodeDup = *new*
TupleTag() {};

PCollectionTuple tupleCode = tuplePhones.get(tagOutListings)

.apply(*new* RemoveDuplicates(

"RemoveDuplicatesCode",

tagCodeUnique,

tagCodeDup,

*new* KeyMapperCode(),

opts.getDuplicateCodeComparator()));


The transform:


@Override

*public* PCollectionTuple expand(PCollection input) {

*return* input

.apply(WithKeys.*of*(*this*.mapper))

.apply(GroupByKey.*create*())

.apply("Pick" + *this*.getName(), ParDo

.*of*(*new* FnPickDuplicate(*this*.tagDuplicates, *this*.
duplicateComparator, *this*.getName()))

.withOutputTags(*this*.tagUnique, TupleTagList.*of*(*this*.tagDuplicates)));

}
*Andres Bravo*
 @SirAndyBrave


Combine with multiple outputs case Sample and the rest

2020-12-18 Thread Ismaël Mejía
I had a question today from one of our users about Beam’s Sample
transform (a Combine with an internal top-like function to produce a
uniform sample of size n of a PCollection). They wanted to obtain also
the rest of the PCollection as an output (the non sampled elements).

My suggestion was to use the sample (since it was little) as a side
input and then reprocess the collection to filter its elements,
however I wonder if this is the ‘best’ solution.

I was thinking also if Combine is essentially GbK + ParDo why we don’t
have a Combine function with multiple outputs (maybe an evolution of
CombineWithContext). I know this sounds weird and I have probably not
thought much about issues or the performance of the translation but I
wanted to see what others thought, does this make sense, do you see
some pros/cons or other ideas.

Thanks,
Ismaël