Re: How does setMaxParallelism work

2018-03-30 Thread Nico Kruber
No, currently, this it is up to you to decide whether you need to scale
and how. If, for a running Flink job, you decide to scale, you
- flink cancel --withSavepoint  
- flink run -p  --fromSavepoint 
 


Nico

On 29/03/18 19:27, NEKRASSOV, ALEXEI wrote:
> Is there an auto-scaling feature in Flink, where I start with parallelism of 
> (for example) 1, but Flink notices I have high volume of data to process, and 
> automatically increases parallelism of a running job?
> 
> Thanks,
> Alex
> 
> -Original Message-
> From: Nico Kruber [mailto:n...@data-artisans.com] 
> Sent: Wednesday, March 28, 2018 8:54 AM
> To: Data Engineer <dataenginee...@gmail.com>
> Cc: Jörn Franke <jornfra...@gmail.com>; user@flink.apache.org
> Subject: Re: How does setMaxParallelism work
> 
> Flink does not decide the parallelism based on your job.
> There is a default parallelism (configured via parallelism.default [1], by 
> default 1) which is used if you do not specify it yourself.
> 
> 
> Nico
> 
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#common-options
> 
> On 28/03/18 13:21, Data Engineer wrote:
>> Agreed. But how did Flink decide that it should allot 1 subtask? Why 
>> not
>> 2 or 3?
>> I am trying to understand the implications of using setMaxParallelism 
>> vs setParallelism
>>
>> On Wed, Mar 28, 2018 at 2:58 PM, Nico Kruber <n...@data-artisans.com 
>> <mailto:n...@data-artisans.com>> wrote:
>>
>> Hi James,
>> the number of subtasks being used is defined by the parallelism, the max
>> parallelism, however, "... determines the maximum parallelism to which
>> you can scale operators" [1]. That is, once set, you cannot ever (even
>> after restarting your program from a savepoint) increase the operator's
>> parallelism above this value. The actual parallelism can be set per job
>> in your program but also in the flink client:
>> flink run -p   
>>
>>
>> Nico
>>
>>
>>
>> [1]
>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
>> 
>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production
>> _ready.html#set-maximum-parallelism-for-operators-explicitly>
>>
>> On 28/03/18 09:25, Data Engineer wrote:
>> > I have a sample application that reads around 2 GB of csv files,
>> > converts each record into Avro object and sends it to kafka.
>> > I use a custom FileReader that reads the files in a directory.
>> > I have set taskmanager.numberOfTaskSlots to 4.
>> > I see that if I use setParallelism(3), 3 subtasks are created. But if I
>> > use setMaxParallelism(3), only 1 subtask is created.
>> >
>> > On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>
>> > <mailto:jornfra...@gmail.com <mailto:jornfra...@gmail.com>>> wrote:
>> >
>> >     What was the input format, the size and the program that you tried
>> >     to execute
>> >
>> >     On 28. Mar 2018, at 08:18, Data Engineer <dataenginee...@gmail.com 
>> <mailto:dataenginee...@gmail.com>
>> >     <mailto:dataenginee...@gmail.com 
>> <mailto:dataenginee...@gmail.com>>> wrote:
>> >
>> >>     I went through the explanation on MaxParallelism in the official
>> >>     docs here:
>> >>     
>> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
>> 
>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly>
>> >>     
>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
>> 
>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly>>
>> >>
>> >>     However, I am not able to figure out how Flink decides the
>> >>     parallelism value.
>> >>     For instance, if I setMaxParallelism to 3, I see that for my job,
>> >>     there is only 1 subtask that is created. How did Flink decide that
>> >>     1 subtask was enough?
>> >>
>> >>

RE: How does setMaxParallelism work

2018-03-29 Thread NEKRASSOV, ALEXEI
Is there an auto-scaling feature in Flink, where I start with parallelism of 
(for example) 1, but Flink notices I have high volume of data to process, and 
automatically increases parallelism of a running job?

Thanks,
Alex

-Original Message-
From: Nico Kruber [mailto:n...@data-artisans.com] 
Sent: Wednesday, March 28, 2018 8:54 AM
To: Data Engineer <dataenginee...@gmail.com>
Cc: Jörn Franke <jornfra...@gmail.com>; user@flink.apache.org
Subject: Re: How does setMaxParallelism work

Flink does not decide the parallelism based on your job.
There is a default parallelism (configured via parallelism.default [1], by 
default 1) which is used if you do not specify it yourself.


Nico

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#common-options

On 28/03/18 13:21, Data Engineer wrote:
> Agreed. But how did Flink decide that it should allot 1 subtask? Why 
> not
> 2 or 3?
> I am trying to understand the implications of using setMaxParallelism 
> vs setParallelism
> 
> On Wed, Mar 28, 2018 at 2:58 PM, Nico Kruber <n...@data-artisans.com 
> <mailto:n...@data-artisans.com>> wrote:
> 
> Hi James,
> the number of subtasks being used is defined by the parallelism, the max
> parallelism, however, "... determines the maximum parallelism to which
> you can scale operators" [1]. That is, once set, you cannot ever (even
> after restarting your program from a savepoint) increase the operator's
> parallelism above this value. The actual parallelism can be set per job
> in your program but also in the flink client:
> flink run -p   
> 
> 
> Nico
> 
> 
> 
> [1]
> 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production
> _ready.html#set-maximum-parallelism-for-operators-explicitly>
> 
> On 28/03/18 09:25, Data Engineer wrote:
> > I have a sample application that reads around 2 GB of csv files,
> > converts each record into Avro object and sends it to kafka.
> > I use a custom FileReader that reads the files in a directory.
> > I have set taskmanager.numberOfTaskSlots to 4.
> > I see that if I use setParallelism(3), 3 subtasks are created. But if I
> > use setMaxParallelism(3), only 1 subtask is created.
> >
> > On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>
> > <mailto:jornfra...@gmail.com <mailto:jornfra...@gmail.com>>> wrote:
> >
> >     What was the input format, the size and the program that you tried
> >     to execute
> >
> >     On 28. Mar 2018, at 08:18, Data Engineer <dataenginee...@gmail.com 
> <mailto:dataenginee...@gmail.com>
> >     <mailto:dataenginee...@gmail.com 
> <mailto:dataenginee...@gmail.com>>> wrote:
> >
> >>     I went through the explanation on MaxParallelism in the official
> >>     docs here:
> >>     
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly>
> >>     
> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> <https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly>>
> >>
> >>     However, I am not able to figure out how Flink decides the
> >>     parallelism value.
> >>     For instance, if I setMaxParallelism to 3, I see that for my job,
> >>     there is only 1 subtask that is created. How did Flink decide that
> >>     1 subtask was enough?
> >>
> >>     Regards,
> >>     James
> >
> >
> 
> --
> Nico Kruber | Software Engineer
> data Artisans
> 
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> 
> 

--
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference Stream Processing | Event 
Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany data Artisans, 
Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. 
Kostas Tzoumas, Dr. Stephan Ewen



Re: How does setMaxParallelism work

2018-03-28 Thread Nico Kruber
Flink does not decide the parallelism based on your job.
There is a default parallelism (configured via parallelism.default [1],
by default 1) which is used if you do not specify it yourself.


Nico

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#common-options

On 28/03/18 13:21, Data Engineer wrote:
> Agreed. But how did Flink decide that it should allot 1 subtask? Why not
> 2 or 3?
> I am trying to understand the implications of using setMaxParallelism vs
> setParallelism
> 
> On Wed, Mar 28, 2018 at 2:58 PM, Nico Kruber  > wrote:
> 
> Hi James,
> the number of subtasks being used is defined by the parallelism, the max
> parallelism, however, "... determines the maximum parallelism to which
> you can scale operators" [1]. That is, once set, you cannot ever (even
> after restarting your program from a savepoint) increase the operator's
> parallelism above this value. The actual parallelism can be set per job
> in your program but also in the flink client:
> flink run -p   
> 
> 
> Nico
> 
> 
> 
> [1]
> 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> 
> 
> On 28/03/18 09:25, Data Engineer wrote:
> > I have a sample application that reads around 2 GB of csv files,
> > converts each record into Avro object and sends it to kafka.
> > I use a custom FileReader that reads the files in a directory.
> > I have set taskmanager.numberOfTaskSlots to 4.
> > I see that if I use setParallelism(3), 3 subtasks are created. But if I
> > use setMaxParallelism(3), only 1 subtask is created.
> >
> > On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke  
> > >> wrote:
> >
> >     What was the input format, the size and the program that you tried
> >     to execute
> >
> >     On 28. Mar 2018, at 08:18, Data Engineer  
> >      >> wrote:
> >
> >>     I went through the explanation on MaxParallelism in the official
> >>     docs here:
> >>     
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> 
> >>     
>  
> >
> >>
> >>     However, I am not able to figure out how Flink decides the
> >>     parallelism value.
> >>     For instance, if I setMaxParallelism to 3, I see that for my job,
> >>     there is only 1 subtask that is created. How did Flink decide that
> >>     1 subtask was enough?
> >>
> >>     Regards,
> >>     James
> >
> >
> 
> --
> Nico Kruber | Software Engineer
> data Artisans
> 
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> 
> 

-- 
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen



signature.asc
Description: OpenPGP digital signature


Re: How does setMaxParallelism work

2018-03-28 Thread Data Engineer
Agreed. But how did Flink decide that it should allot 1 subtask? Why not 2
or 3?
I am trying to understand the implications of using setMaxParallelism vs
setParallelism

On Wed, Mar 28, 2018 at 2:58 PM, Nico Kruber  wrote:

> Hi James,
> the number of subtasks being used is defined by the parallelism, the max
> parallelism, however, "... determines the maximum parallelism to which
> you can scale operators" [1]. That is, once set, you cannot ever (even
> after restarting your program from a savepoint) increase the operator's
> parallelism above this value. The actual parallelism can be set per job
> in your program but also in the flink client:
> flink run -p   
>
>
> Nico
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-
> master/ops/production_ready.html#set-maximum-parallelism-
> for-operators-explicitly
>
> On 28/03/18 09:25, Data Engineer wrote:
> > I have a sample application that reads around 2 GB of csv files,
> > converts each record into Avro object and sends it to kafka.
> > I use a custom FileReader that reads the files in a directory.
> > I have set taskmanager.numberOfTaskSlots to 4.
> > I see that if I use setParallelism(3), 3 subtasks are created. But if I
> > use setMaxParallelism(3), only 1 subtask is created.
> >
> > On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke  > > wrote:
> >
> > What was the input format, the size and the program that you tried
> > to execute
> >
> > On 28. Mar 2018, at 08:18, Data Engineer  > > wrote:
> >
> >> I went through the explanation on MaxParallelism in the official
> >> docs here:
> >> https://ci.apache.org/projects/flink/flink-docs-
> master/ops/production_ready.html#set-maximum-parallelism-
> for-operators-explicitly
> >>  master/ops/production_ready.html#set-maximum-parallelism-
> for-operators-explicitly>
> >>
> >> However, I am not able to figure out how Flink decides the
> >> parallelism value.
> >> For instance, if I setMaxParallelism to 3, I see that for my job,
> >> there is only 1 subtask that is created. How did Flink decide that
> >> 1 subtask was enough?
> >>
> >> Regards,
> >> James
> >
> >
>
> --
> Nico Kruber | Software Engineer
> data Artisans
>
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
>


Re: How does setMaxParallelism work

2018-03-28 Thread Nico Kruber
Hi James,
the number of subtasks being used is defined by the parallelism, the max
parallelism, however, "... determines the maximum parallelism to which
you can scale operators" [1]. That is, once set, you cannot ever (even
after restarting your program from a savepoint) increase the operator's
parallelism above this value. The actual parallelism can be set per job
in your program but also in the flink client:
flink run -p   


Nico



[1]
https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly

On 28/03/18 09:25, Data Engineer wrote:
> I have a sample application that reads around 2 GB of csv files,
> converts each record into Avro object and sends it to kafka.
> I use a custom FileReader that reads the files in a directory.
> I have set taskmanager.numberOfTaskSlots to 4.
> I see that if I use setParallelism(3), 3 subtasks are created. But if I
> use setMaxParallelism(3), only 1 subtask is created.
> 
> On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke  > wrote:
> 
> What was the input format, the size and the program that you tried
> to execute
> 
> On 28. Mar 2018, at 08:18, Data Engineer  > wrote:
> 
>> I went through the explanation on MaxParallelism in the official
>> docs here:
>> 
>> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
>> 
>> 
>>
>> However, I am not able to figure out how Flink decides the
>> parallelism value.
>> For instance, if I setMaxParallelism to 3, I see that for my job,
>> there is only 1 subtask that is created. How did Flink decide that
>> 1 subtask was enough?
>>
>> Regards,
>> James
> 
> 

-- 
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen



signature.asc
Description: OpenPGP digital signature


Re: How does setMaxParallelism work

2018-03-28 Thread Data Engineer
I have a sample application that reads around 2 GB of csv files, converts
each record into Avro object and sends it to kafka.
I use a custom FileReader that reads the files in a directory.
I have set taskmanager.numberOfTaskSlots to 4.
I see that if I use setParallelism(3), 3 subtasks are created. But if I use
setMaxParallelism(3), only 1 subtask is created.

On Wed, Mar 28, 2018 at 12:29 PM, Jörn Franke  wrote:

> What was the input format, the size and the program that you tried to
> execute
>
> On 28. Mar 2018, at 08:18, Data Engineer  wrote:
>
> I went through the explanation on MaxParallelism in the official docs here:
> https://ci.apache.org/projects/flink/flink-docs-
> master/ops/production_ready.html#set-maximum-parallelism-
> for-operators-explicitly
>
> However, I am not able to figure out how Flink decides the parallelism
> value.
> For instance, if I setMaxParallelism to 3, I see that for my job, there is
> only 1 subtask that is created. How did Flink decide that 1 subtask was
> enough?
>
> Regards,
> James
>
>


Re: How does setMaxParallelism work

2018-03-28 Thread Jörn Franke
What was the input format, the size and the program that you tried to execute

> On 28. Mar 2018, at 08:18, Data Engineer  wrote:
> 
> I went through the explanation on MaxParallelism in the official docs here:
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly
> 
> However, I am not able to figure out how Flink decides the parallelism value.
> For instance, if I setMaxParallelism to 3, I see that for my job, there is 
> only 1 subtask that is created. How did Flink decide that 1 subtask was 
> enough?
> 
> Regards,
> James


How does setMaxParallelism work

2018-03-28 Thread Data Engineer
I went through the explanation on MaxParallelism in the official docs here:
https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-maximum-parallelism-for-operators-explicitly

However, I am not able to figure out how Flink decides the parallelism
value.
For instance, if I setMaxParallelism to 3, I see that for my job, there is
only 1 subtask that is created. How did Flink decide that 1 subtask was
enough?

Regards,
James