Question about side outputs and OutputTags in pyflink. The docs <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/side_output/> say we are supposed to
yield output_tag, value Docs then say: > For retrieving the side output stream you use getSideOutput(OutputTag) on the result of the DataStream operation. >From this, I'd expect that calling datastream.get_side_output would be optional. However, it seems that if you do not call datastream.get_side_output, then the main datastream will have the record destined to the output tag still in it, as a Tuple(output_tag, value). This caused me great confusion for a while, as my downstream tasks would break because of the unexpected Tuple type of the record. Here's an example of the failure using side output and ProcessFunction in the word count example. <https://gist.github.com/ottomata/001df5df72eb1224c01c9827399fcbd7#file-pyflink_sideout_fail_word_count_example-py-L86-L100> I'd expect that just yielding an output_tag would make those records be in a different datastream, but apparently this is not the case unless you call get_side_output. If this is the expected behavior, perhaps the docs should be updated to say so? -Andrew Otto Wikimedia Foundation