This is with respect to how work is assigned to a Task. For a shuffle edge, a Task’s input is determined based on the partitions and how partitions are assigned to a Task. For a vertex reading data from HDFS ( initial input ). this is effectively random as the input data is split up and then assigned to tasks.
When trying to combine the data, the user would need to write a custom vertex manager to handle correctly assigning data from the initial input and the shuffle edge in a deterministic manner ( and other user-specific conditions such as trying to do a “join” ) for the processing to be correctly done. I believe Hive has a couple of cases where this is implemented. You should ask on the dev@hive list for more details. — Hitesh On May 18, 2015, at 9:00 AM, Oleg Zhurakousky <[email protected]> wrote: > Also, while trying something related to this i’ve noticed the following: "A > vertex with an Initial Input and a Shuffle Input are not supported at the > moment”. > Is there a target timeframe for this? JIRA? > > Thanks > Oleg > >> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky >> <[email protected]> wrote: >> >> Is it possible to allow Tez processor implementation which has multiple >> inputs to become available as soon as at least one input is available to be >> read. >> This could allow for some computation to begin while waiting for other >> inputs. Other inputs could (if logic allows) be processed as they become >> available. >> >> >> Thanks >> Oleg >
