Re: Pyflink Side Output Question and/or suggested documentation change

2023-02-13 Thread Andrew Otto
Thank you!

On Mon, Feb 13, 2023 at 5:55 AM Dian Fu  wrote:

> Thanks Andrew, I think this is a valid advice. I will update the
> documentation~
>
> Regards,
> Dian
>
> ,
>
> On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto  wrote:
>
>> Question about side outputs and OutputTags in pyflink.  The docs
>> 
>> say we are supposed to
>>
>> yield output_tag, value
>>
>> Docs then say:
>> > For retrieving the side output stream you use getSideOutput(OutputTag) on
>> the result of the DataStream operation.
>>
>> From this, I'd expect that calling datastream.get_side_output would be
>> optional.   However, it seems that if you do not call
>> datastream.get_side_output, then the main datastream will have the record
>> destined to the output tag still in it, as a Tuple(output_tag, value).
>> This caused me great confusion for a while, as my downstream tasks would
>> break because of the unexpected Tuple type of the record.
>>
>> Here's an example of the failure using side output and ProcessFunction
>> in the word count example.
>> 
>>
>> I'd expect that just yielding an output_tag would make those records be
>> in a different datastream, but apparently this is not the case unless you
>> call get_side_output.
>>
>> If this is the expected behavior, perhaps the docs should be updated to
>> say so?
>>
>> -Andrew Otto
>>  Wikimedia Foundation
>>
>>
>>
>>
>>


Re: Pyflink Side Output Question and/or suggested documentation change

2023-02-13 Thread Dian Fu
Thanks Andrew, I think this is a valid advice. I will update the
documentation~

Regards,
Dian

,

On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto  wrote:

> Question about side outputs and OutputTags in pyflink.  The docs
> 
> say we are supposed to
>
> yield output_tag, value
>
> Docs then say:
> > For retrieving the side output stream you use getSideOutput(OutputTag) on
> the result of the DataStream operation.
>
> From this, I'd expect that calling datastream.get_side_output would be
> optional.   However, it seems that if you do not call
> datastream.get_side_output, then the main datastream will have the record
> destined to the output tag still in it, as a Tuple(output_tag, value).
> This caused me great confusion for a while, as my downstream tasks would
> break because of the unexpected Tuple type of the record.
>
> Here's an example of the failure using side output and ProcessFunction in
> the word count example.
> 
>
> I'd expect that just yielding an output_tag would make those records be in
> a different datastream, but apparently this is not the case unless you call
> get_side_output.
>
> If this is the expected behavior, perhaps the docs should be updated to
> say so?
>
> -Andrew Otto
>  Wikimedia Foundation
>
>
>
>
>


Pyflink Side Output Question and/or suggested documentation change

2023-02-10 Thread Andrew Otto
Question about side outputs and OutputTags in pyflink.  The docs

say we are supposed to

yield output_tag, value

Docs then say:
> For retrieving the side output stream you use getSideOutput(OutputTag) on
the result of the DataStream operation.

>From this, I'd expect that calling datastream.get_side_output would be
optional.   However, it seems that if you do not call
datastream.get_side_output, then the main datastream will have the record
destined to the output tag still in it, as a Tuple(output_tag, value).
This caused me great confusion for a while, as my downstream tasks would
break because of the unexpected Tuple type of the record.

Here's an example of the failure using side output and ProcessFunction in
the word count example.


I'd expect that just yielding an output_tag would make those records be in
a different datastream, but apparently this is not the case unless you call
get_side_output.

If this is the expected behavior, perhaps the docs should be updated to say
so?

-Andrew Otto
 Wikimedia Foundation