Alex Amato created BEAM-7026:
--------------------------------

             Summary: Python SDK: Unable to obtain the PCollection for output 
tags which are not consumed by a downstream step.
                 Key: BEAM-7026
                 URL: https://issues.apache.org/jira/browse/BEAM-7026
             Project: Beam
          Issue Type: New Feature
          Components: sdk-py-harness
            Reporter: Alex Amato


I noticed that we are not able to convert the output tag+transform to the 
pcollection name for metrics (element count/mean byte count), if the 
Pcollections for the outputed tags are not consumed by a downstream step.

This isn't critical as (1) Arguably there is no pcollection at all. (2) Output 
but not consumed PCollections are not critical to count metrics on as those can 
be optomized away entirely (No need to do any work, collect metrics, etc. for 
an unconsumed pcollection).

However, we are able to count this, but we are unable to assign a pcollection 
name for it, as in this case there is no information about that output tag 
defined in the bundle descriptor. The alternative fix is to make sure that its 
always available, even if not consumed.

Pablo and I looked into this a bit, and he believed it would be possible in 
pvalue.py'sĀ 

DoOutputsTuple class. This fix would require callingĀ __getitem__ on all tags to 
initialize them properly. However, I had some trouble doing this, as this class 
is a bit strange since it overrides __getattr__. I found weird behaviors when 
adding functionality to this code. I don't really get how the code functions 
today, as its own instance variable usage should trigger the custom __getattr__ 
code, yet we seem to be using these attrs normally with self.X usages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to