RE: get input size for each task

Bikas Saha Wed, 09 Jul 2014 10:29:26 -0700

>>even for the same job and same total input size, the number of tasks and
such input size per task will differ for consecutive runs

Correct, but for main inputs (HDFS input) you could choose to not use that.

>> the number of tasks for a vertex can change at runtime(not about
duplicates, but distinct tasks)

Yes, dependent on the data stats the number of reducers could change per
job. Again, if that optimization is turned on.

*From:* Grandl Robert [mailto:[email protected]]
*Sent:* Wednesday, July 09, 2014 10:03 AM
*To:* [email protected]; Hitesh Shah
*Subject:* Re: get input size for each task

Thanks a lot for your detailed answer.

When you say *"it is now decided based on a combination of available
cluster resources as well as the input data"*: what do you mean by
available cluster resources ? Total resources in the cluster(like number
nodes * capability of each node), or instantaneous available resources
based on current workload on each such node.

So to be clear: even for the same job and same total input size, the number
of tasks and such input size per task will differ for consecutive runs ?

Also, it seems the number of tasks for a vertex can change at runtime(not
about duplicates, but distinct tasks) ?

thanks,

Robert

On Wednesday, July 9, 2014 7:48 AM, Hitesh Shah <[email protected]> wrote:

For root vertices ( ones which read from HDFS ), the no. of tasks can be
decided based on the size of the input data though in most cases, it is now
decided based on a combination of available cluster resources as well as
the input data. There are some special cases in Hive where a vertex could
have an HDFS input as well as an edge from another vertex where the
decision making is non trivial. For intermediate vertices, no. of tasks is
determined either by the user or at runtime based on size of inbound data.

In any case, the input data is understood by the InputFormat and the
specialized class decides how to “split” this data set. Furthermore, based
on additional grouping logic, multiple splits based on their data's
location could be combined together to define a single unit of work for a
given task. This grouping is defined based on a min/max data size range and
available cluster resources. All in all, there are no guarantees that they
will be of an equal input size. It is the common case for FileInputFormat
which splits the data ( one split per block ) as long as grouping is not
enabled.

thanks
— Hitesh

On Jul 8, 2014, at 8:29 PM, Grandl Robert <[email protected]> wrote:

> Hitesh,
>
> With respect to the below comment: So a vertex will have a number of
tasks, which is decided strictly based on the input data the vertex has to
process ? Also, it is guaranteed that every task will have same input size
? (all except the last one probably).
>
> Thanks,
> Robert
>
> Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each
relationship being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel
data set. A task represents processing of a subset of the data set.
>
>
> On Monday, July 7, 2014 10:37 AM, Hitesh Shah <[email protected]> wrote:
>
>
> Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each
relationship being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel
data set. A task represents processing of a subset of the data set.
>
> thanks
> — Hitesh
>
> On Jul 7, 2014, at 9:40 AM, Grandl Robert <[email protected]> wrote:
>
> > Another dumb question: A vertex can have multiple tasks(not task
attempts), for different input blocks, right ? So a vertex entity is kind
of a stage abstraction, not a task abstraction, right ?
> >
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: get input size for each task

Reply via email to