>>even for the same job and same total input size, the number of tasks and such input size per task will differ for consecutive runs
Correct, but for main inputs (HDFS input) you could choose to not use that. >> the number of tasks for a vertex can change at runtime(not about duplicates, but distinct tasks) Yes, dependent on the data stats the number of reducers could change per job. Again, if that optimization is turned on. *From:* Grandl Robert [mailto:[email protected]] *Sent:* Wednesday, July 09, 2014 10:03 AM *To:* [email protected]; Hitesh Shah *Subject:* Re: get input size for each task Thanks a lot for your detailed answer. When you say *"it is now decided based on a combination of available cluster resources as well as the input data"*: what do you mean by available cluster resources ? Total resources in the cluster(like number nodes * capability of each node), or instantaneous available resources based on current workload on each such node. So to be clear: even for the same job and same total input size, the number of tasks and such input size per task will differ for consecutive runs ? Also, it seems the number of tasks for a vertex can change at runtime(not about duplicates, but distinct tasks) ? thanks, Robert On Wednesday, July 9, 2014 7:48 AM, Hitesh Shah <[email protected]> wrote: For root vertices ( ones which read from HDFS ), the no. of tasks can be decided based on the size of the input data though in most cases, it is now decided based on a combination of available cluster resources as well as the input data. There are some special cases in Hive where a vertex could have an HDFS input as well as an edge from another vertex where the decision making is non trivial. For intermediate vertices, no. of tasks is determined either by the user or at runtime based on size of inbound data. In any case, the input data is understood by the InputFormat and the specialized class decides how to “split” this data set. Furthermore, based on additional grouping logic, multiple splits based on their data's location could be combined together to define a single unit of work for a given task. This grouping is defined based on a min/max data size range and available cluster resources. All in all, there are no guarantees that they will be of an equal input size. It is the common case for FileInputFormat which splits the data ( one split per block ) as long as grouping is not enabled. thanks — Hitesh On Jul 8, 2014, at 8:29 PM, Grandl Robert <[email protected]> wrote: > Hitesh, > > With respect to the below comment: So a vertex will have a number of tasks, which is decided strictly based on the input data the vertex has to process ? Also, it is guaranteed that every task will have same input size ? (all except the last one probably). > > Thanks, > Robert > > Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each relationship being a 1:N ). > Vertex defines a stage of common processing logic applied on a parallel data set. A task represents processing of a subset of the data set. > > > On Monday, July 7, 2014 10:37 AM, Hitesh Shah <[email protected]> wrote: > > > Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each relationship being a 1:N ). > Vertex defines a stage of common processing logic applied on a parallel data set. A task represents processing of a subset of the data set. > > thanks > — Hitesh > > On Jul 7, 2014, at 9:40 AM, Grandl Robert <[email protected]> wrote: > > > Another dumb question: A vertex can have multiple tasks(not task attempts), for different input blocks, right ? So a vertex entity is kind of a stage abstraction, not a task abstraction, right ? > > > > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
