Ok, thanks for the clarification Till!

On Thu, Mar 31, 2016 at 2:14 PM, Till Rohrmann <[email protected]> wrote:

> A partition is the portion of data each task receives. Thus, the degree of
> parallelism of your program/task decides how many different partitions you
> have. Depending on the upstream operators (and which data is send to which
> task), the partitions will most likely differ in size.
>
> Cheers,
> Till
>
> On Thu, Mar 31, 2016 at 2:11 PM, Flavio Pompermaier <[email protected]>
> wrote:
>
>> Hi Till and Tarandeep,
>> I'm also interested in better understanding my knowledge about the
>> concept of a partition..
>> From what I know a partition is the portion of data assigned by the job
>> manager to each task manager..right?
>> Then, each partition is divided again at the task manager to maximize the
>> slot usage..is it correct?
>> In every case, there will be a case where at least one partition is
>> smaller than the others...am I wrong? Am I confusing some term..?
>>
>> Best,
>> Flavio
>>
>>
>> On Thu, Mar 31, 2016 at 1:56 PM, Till Rohrmann <[email protected]>
>> wrote:
>>
>>> Hi Tarandeep,
>>>
>>> the number of elements in each partition should stay constant. In fact
>>> the elements in each partition should not change.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking at implementation of zipWithIndex in DataSetUtils-
>>>>
>>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java
>>>>
>>>> It works in two phases/steps
>>>> 1) Count number of elements in each partition (using mapPartition)
>>>> 2) In second mapPartition, unique ID is assigned by calculating offset
>>>> using number of elements computed in step 1.
>>>>
>>>> Is there any chance the second mapPartition won't get same number of
>>>> elements as first mapPartition (assuming data is in HDFS)?
>>>>
>>>> Thanks
>>>> Tarandeep
>>>>
>>>
>>>
>>
>

Reply via email to