Ok, thanks for the clarification Till! On Thu, Mar 31, 2016 at 2:14 PM, Till Rohrmann <[email protected]> wrote:
> A partition is the portion of data each task receives. Thus, the degree of > parallelism of your program/task decides how many different partitions you > have. Depending on the upstream operators (and which data is send to which > task), the partitions will most likely differ in size. > > Cheers, > Till > > On Thu, Mar 31, 2016 at 2:11 PM, Flavio Pompermaier <[email protected]> > wrote: > >> Hi Till and Tarandeep, >> I'm also interested in better understanding my knowledge about the >> concept of a partition.. >> From what I know a partition is the portion of data assigned by the job >> manager to each task manager..right? >> Then, each partition is divided again at the task manager to maximize the >> slot usage..is it correct? >> In every case, there will be a case where at least one partition is >> smaller than the others...am I wrong? Am I confusing some term..? >> >> Best, >> Flavio >> >> >> On Thu, Mar 31, 2016 at 1:56 PM, Till Rohrmann <[email protected]> >> wrote: >> >>> Hi Tarandeep, >>> >>> the number of elements in each partition should stay constant. In fact >>> the elements in each partition should not change. >>> >>> Cheers, >>> Till >>> >>> On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I am looking at implementation of zipWithIndex in DataSetUtils- >>>> >>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java >>>> >>>> It works in two phases/steps >>>> 1) Count number of elements in each partition (using mapPartition) >>>> 2) In second mapPartition, unique ID is assigned by calculating offset >>>> using number of elements computed in step 1. >>>> >>>> Is there any chance the second mapPartition won't get same number of >>>> elements as first mapPartition (assuming data is in HDFS)? >>>> >>>> Thanks >>>> Tarandeep >>>> >>> >>> >> >
