subject:"\[GitHub\] \[flink\] knaufk commented on a change in pull request #11092\: \[FLINK\-15999\] Extract “Concepts” material from API\/Library sections and start proper concepts section"

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-15 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379550966
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
+
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
+transformation may consist of multiple transformation operators.
+
+{% top %}
+
+## Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+
+
+Streams can transport data between two operators in a *one-to-one* (or
 
 Review comment:
   To me the way Fabian presents theses data exchange patterns in his book is 
easier to understand. I think it was: 
   
   * Forward
   * Broadcast
   * Random
   * Keyed
   
   IMHO the additional classification in "Redistributing" and "One-to-one" does 
not help. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379569902
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379575432
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379572141
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379560406
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
 
 Review comment:
   As far as I remember, this was supposed to be about the conceptual 
differences between state handling in batch and stream processing. I don't 
remember exactly, what we talked about here. The only thing I can think of 
right now is, that the data structures in which state is stored need to be 
different in batch and stream processing for it to be efficient.
   
   I guess one could also say something about boundedness and expiry of state?
   
   @StephanEwen what did we have in mind here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379554124
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
 
 Review comment:
   is the sequence of events the state or does the state store the sequence of 
events? I would say the former. Same for the examples below.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379575236
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379587916
 
 

 ##
 File path: docs/dev/stream/state/state.md
 ##
 @@ -22,66 +22,17 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-This document explains how to use Flink's state abstractions when developing 
an application.
+In this section you will learn about the stateful abstractions that Flink
 
 Review comment:
   see above


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379586538
 
 

 ##
 File path: docs/concepts/timely-stream-processing.md
 ##
 @@ -0,0 +1,237 @@
+---
+title: Timely Stream Processing
+nav-id: timely-stream-processing
+nav-pos: 3
+nav-title: Timely Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: add introduction`
+
+* This will be replaced by the TOC
+{:toc}
+
+## Latency & Completeness
+
+`TODO: add these two sections`
+
+### Latency vs. Completeness in Batch & Stream Processing
+
+{% top %}
+
+## Event Time, Processing Time, and Ingestion Time
+
+When referring to time in a streaming program (for example to define windows),
+one can refer to different notions of *time*:
+
+- **Processing time:** Processing time refers to the system time of the machine
+  that is executing the respective operation.
+
+  When a streaming program runs on processing time, all time-based operations
+  (like time windows) will use the system clock of the machines that run the
+  respective operator. An hourly processing time window will include all
+  records that arrived at a specific operator between the times when the system
+  clock indicated the full hour. For example, if an application begins running
+  at 9:15am, the first hourly processing time window will include events
+  processed between 9:15am and 10:00am, the next window will include events
+  processed between 10:00am and 11:00am, and so on.
+
+  Processing time is the simplest notion of time and requires no coordination
+  between streams and machines.  It provides the best performance and the
+  lowest latency. However, in distributed and asynchronous environments
+  processing time does not provide determinism, because it is susceptible to
+  the speed at which records arrive in the system (for example from the message
+  queue), to the speed at which the records flow between operators inside the
+  system, and to outages (scheduled, or otherwise).
+
+- **Event time:** Event time is the time that each individual event occurred on
+  its producing device.  This time is typically embedded within the records
+  before they enter Flink, and that *event timestamp* can be extracted from
+  each record. In event time, the progress of time depends on the data, not on
+  any wall clocks. Event time programs must specify how to generate *Event Time
+  Watermarks*, which is the mechanism that signals progress in event time. This
+  watermarking mechanism is described in a later section,
+  [below](#event-time-and-watermarks).
+
+  In a perfect world, event time processing would yield completely consistent
+  and deterministic results, regardless of when events arrive, or their
+  ordering.  However, unless the events are known to arrive in-order (by
+  timestamp), event time processing incurs some latency while waiting for
+  out-of-order events. As it is only possible to wait for a finite period of
+  time, this places a limit on how deterministic event time applications can
+  be.
+
+  Assuming all of the data has arrived, event time operations will behave as
+  expected, and produce correct and consistent results even when working with
+  out-of-order or late events, or when reprocessing historic data. For example,
+  an hourly event time window will contain all records that carry an event
+  timestamp that falls into that hour, regardless of the order in which they
+  arrive, or when they are processed. (See the section on [late
+  events](#late-elements) for more information.)
+
+
+
+  Note that sometimes when event time programs are processing live data in
+  real-time, they will use some *processing time* operations in order to
+  guarantee that they are progressing in a timely fashion.
+
+- **Ingestion time:** Ingestion time is the time that events enter Flink. At
+  the source operator each record gets the source's current time as a
+  timestamp, and time-based operations (like time windows) refer to that
+  timestamp.
+
+  *Ingestion time* sits conceptually in between *event time* and *processing
+  time*. Compared to *processing time*, it is slightly more expensive, but
+  gives more predictable results. Because *ingestion time* uses stable
+  timestamps (assigned once at the source), different window operations over
+  the records will refer to the same timestamp, whereas in *processing time*
+  each window operator may assign the record to a different window (based on
+  the local system clock and any transport delay).
+
+  Compared to *event time*, *ingestion time* programs cannot handle any
+  out-of-order events or late data, but the programs don't have to specify how
+  to generate *watermarks*.
+
+  Internally, *ingestion time* is treated much like *event time*, but with
+  automatic timestamp assignment and automatic watermark generation.
+
+
+
+{% top %}
+

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379562658
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
 
 Review comment:
   I think, this introduction should already mention the concept and importance 
of state locality. Maybe with the typical figure of two-tiered architecture to 
state and logic fused into one thing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379554299
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
 
 Review comment:
   ```suggestion
   Flink needs to be aware of the state in order to it fault tolerant
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379583881
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
+have multiple *Flink Masters*, one of which one is always the *leader*, and
+the others are *standby*.
+
+  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data
+*streams*.
+
+There must always be at least one TaskManager.
+
+The Flink Master and TaskManagers can be started in various ways: directly on
+the machines as a [standalone cluster]({{ site.baseurl }}{% link
+ops/deployment/cluster_setup.md %}), in containers, or managed by resource
+frameworks like [YARN]({{ site.baseurl }}{% link ops/deployment/yarn_setup.md
+%}) or [Mesos]({{ site.baseurl }}{% link ops/deployment/mesos.md %}).
+TaskManagers connect to Flink Masters, announcing themselves as available, and
+are assigned work.
+
+The *client* is not part of the runtime and program execution, but is used to
+prepare and send a dataflow to the Flink Master.  After that, the client can
+disconnect, or stay connected to receive progress reports. The client runs
+either as part of the Java/Scala program that triggers the execution, or in the
+command line process `./bin/flink run ...`.
+
+
+
+{% top %}
+
+## Tasks and Operator Chains
+
+For distributed execution, Flink *chains* operator subtasks together into
+*tasks*. Each task is executed by one thread.  Chaining operators together into
+tasks is a useful optimization: it reduces the overhead of thread-to-thread
+handover and buffering, and increases overall throughput while decreasing
+latency.  The chaining behavior can be configured; see the [chaining docs]({{
+site.baseurl }}{% link dev/stream/operators/index.md
+%}#task-chaining-and-resource-groups) for details.
+
+The sample dataflow in the figure below is executed with five subtasks, and
+hence with five parallel threads.
+
+
+
+{% top %}
+
+## Task Slots and Resources
+
+Each worker (TaskManager) is a *JVM process*, and may execute one or more
+subtasks in separate threads.  To control how many tasks a worker accepts, a
+worker has so called **task slots** (at least one).
+
+Each *task slot* represents a fixed subset of resources of the TaskManager. A
 
 Review comment:
   There is already quite some operational content mixed in here. Could be 
shorter in the Concepts section in my opinion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379547482
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
+
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
+transformation may consist of multiple transformation operators.
+
+{% top %}
+
+## Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+
 
 Review comment:
   I would update this figure. 
   
   Top: Logical (Data Flow) Graph 
   Bottom: Physical Graph
   
   In the bottom graph, all subtasks for an operator together are *not* an 
operator. It looks like it in the figure.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379584764
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
+have multiple *Flink Masters*, one of which one is always the *leader*, and
+the others are *standby*.
+
+  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data
+*streams*.
+
+There must always be at least one TaskManager.
+
+The Flink Master and TaskManagers can be started in various ways: directly on
+the machines as a [standalone cluster]({{ site.baseurl }}{% link
+ops/deployment/cluster_setup.md %}), in containers, or managed by resource
+frameworks like [YARN]({{ site.baseurl }}{% link ops/deployment/yarn_setup.md
+%}) or [Mesos]({{ site.baseurl }}{% link ops/deployment/mesos.md %}).
+TaskManagers connect to Flink Masters, announcing themselves as available, and
+are assigned work.
+
+The *client* is not part of the runtime and program execution, but is used to
+prepare and send a dataflow to the Flink Master.  After that, the client can
+disconnect, or stay connected to receive progress reports. The client runs
+either as part of the Java/Scala program that triggers the execution, or in the
+command line process `./bin/flink run ...`.
+
+
+
+{% top %}
+
+## Tasks and Operator Chains
+
+For distributed execution, Flink *chains* operator subtasks together into
+*tasks*. Each task is executed by one thread.  Chaining operators together into
+tasks is a useful optimization: it reduces the overhead of thread-to-thread
+handover and buffering, and increases overall throughput while decreasing
+latency.  The chaining behavior can be configured; see the [chaining docs]({{
+site.baseurl }}{% link dev/stream/operators/index.md
+%}#task-chaining-and-resource-groups) for details.
+
+The sample dataflow in the figure below is executed with five subtasks, and
+hence with five parallel threads.
+
+
+
+{% top %}
+
+## Task Slots and Resources
+
+Each worker (TaskManager) is a *JVM process*, and may execute one or more
+subtasks in separate threads.  To control how many tasks a worker accepts, a
+worker has so called **task slots** (at least one).
+
+Each *task slot* represents a fixed subset of resources of the TaskManager. A
+TaskManager with three slots, for example, will dedicate 1/3 of its managed
+memory to each slot. Slotting the resources means that a subtask will not
+compete with subtasks from other jobs for managed memory, but instead has a
+certain amount of reserved managed memory. Note that no CPU isolation happens
+here; currently slots only separate the managed memory of tasks.
+
+By adjusting the number of task slots, users can define how subtasks are
+isolated from each other.  Having one slot per TaskManager means each task
+group runs in a separate JVM (which can be started in a separate container, for
+example). Having multiple slots means more subtasks share the same JVM. Tasks
+in the same JVM share TCP connections (via multiplexing) and heartbeat
+messages. They may also share data sets and data structures, thus reducing the
+per-task overhead.
+
+
+
+By default, Flink allows subtasks to share slots even if they are subtasks of
+different tasks, so long as they are from the same job. The result is that one
+slot may hold an entire pipeline of the job. Allowing this *slot sharing* has
+two main benefits:
+
+  - A Flink cluster needs exactly as many task slots as the highest parallelism
+used in the job.  No need to calculate how many tasks (with varying
+parallelism) a program contains in total.
+
+  - It is easier to get better resource utilization. Without slot sharing, the
+non-intensive *source/map()* subtasks would block as many resources as the
+resource intensive *window* subtasks.  With slot sharing, increasing the
+base parallelism in our example from two to six yields full utilization of
+the slotted resources, while making sure that the heavy subtasks are fairly
+distributed among the TaskManagers.
+
+
+
+The APIs also include a

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379565697
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
 
 Review comment:
   I would not consider KeyGroups a conceptual topics, but rather an 
operational or internal. I might miss something though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379583328
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
+have multiple *Flink Masters*, one of which one is always the *leader*, and
+the others are *standby*.
+
+  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data
+*streams*.
+
+There must always be at least one TaskManager.
+
+The Flink Master and TaskManagers can be started in various ways: directly on
+the machines as a [standalone cluster]({{ site.baseurl }}{% link
+ops/deployment/cluster_setup.md %}), in containers, or managed by resource
+frameworks like [YARN]({{ site.baseurl }}{% link ops/deployment/yarn_setup.md
+%}) or [Mesos]({{ site.baseurl }}{% link ops/deployment/mesos.md %}).
+TaskManagers connect to Flink Masters, announcing themselves as available, and
+are assigned work.
+
+The *client* is not part of the runtime and program execution, but is used to
+prepare and send a dataflow to the Flink Master.  After that, the client can
+disconnect, or stay connected to receive progress reports. The client runs
+either as part of the Java/Scala program that triggers the execution, or in the
+command line process `./bin/flink run ...`.
+
+
+
+{% top %}
+
+## Tasks and Operator Chains
+
+For distributed execution, Flink *chains* operator subtasks together into
+*tasks*. Each task is executed by one thread.  Chaining operators together into
+tasks is a useful optimization: it reduces the overhead of thread-to-thread
+handover and buffering, and increases overall throughput while decreasing
+latency.  The chaining behavior can be configured; see the [chaining docs]({{
+site.baseurl }}{% link dev/stream/operators/index.md
+%}#task-chaining-and-resource-groups) for details.
+
+The sample dataflow in the figure below is executed with five subtasks, and
+hence with five parallel threads.
+
+
 
 Review comment:
   Top: Logical Graph, there are no tasks in the logical graph
   Bottom: Physical Graph


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379567637
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379573424
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379576013
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379569436
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379571150
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379587809
 
 

 ##
 File path: docs/dev/stream/state/index.md
 ##
 @@ -25,23 +25,10 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-Stateful functions and operators store data across the processing of 
individual elements/events, making state a critical building block for
-any type of more elaborate operation.
-
-For example:
-
-  - When an application searches for certain event patterns, the state will 
store the sequence of events encountered so far.
-  - When aggregating events per minute/hour/day, the state holds the pending 
aggregates.
-  - When training a machine learning model over a stream of data points, the 
state holds the current version of the model parameters.
-  - When historic data needs to be managed, the state allows efficient access 
to events that occurred in the past.
-
-Flink needs to be aware of the state in order to make state fault tolerant 
using [checkpoints](checkpointing.html) and to allow [savepoints]({{ 
site.baseurl }}/ops/state/savepoints.html) of streaming applications.
-
-Knowledge about the state also allows for rescaling Flink applications, 
meaning that Flink takes care of redistributing state across parallel instances.
-
-The [queryable state](queryable_state.html) feature of Flink allows you to 
access state from outside of Flink during runtime.
-
-When working with state, it might also be useful to read about [Flink's state 
backends]({{ site.baseurl }}/ops/state/state_backends.html). Flink provides 
different state backends that specify how and where state is stored. State can 
be located on Java's heap or off-heap. Depending on your state backend, Flink 
can also *manage* the state for the application, meaning Flink deals with the 
memory management (possibly spilling to disk if necessary) to allow 
applications to hold very large state. State backends can be configured without 
changing your application logic.
+In this section you will learn about the stateful abstractions that Flink
 
 Review comment:
   are abstractions stateful?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379566326
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
 
 Review comment:
   ```suggestion
   operator instances when the parallelism is changed. There are different
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379546260
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
+
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
+transformation may consist of multiple transformation operators.
 
 Review comment:
   ```suggestion
   transformation may consist of multiple operators.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379568312
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379582165
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
 
 Review comment:
   ```suggestion
   There is always at least one *Flink Master*. A high-availability setup 
might
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379556110
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
 
 Review comment:
   Queryable State allows your to...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379570309
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379576154
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379581742
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379546166
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
+
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
 
 Review comment:
   ```suggestion
   programs and the operators in the logical dataflow graph. Sometimes, 
however, one
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379552116
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
 
 Review comment:
   Operations vs Operator vs Function vs Transformation?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379571692
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379550966
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
+
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
+transformation may consist of multiple transformation operators.
+
+{% top %}
+
+## Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+
+
+Streams can transport data between two operators in a *one-to-one* (or
 
 Review comment:
   I think,  different redistribution patterns that Fabian it in his book is 
more to the point. I think it was: 
   
   * Forward
   * Broadcast
   * Random
   * Keyed
   
   IMHO the additional classification in "Redistributing" and "One-to-one" does 
not help. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379545948
 
 

 ##
 File path: docs/concepts/stream-processing.md
 ##
 @@ -0,0 +1,96 @@
+---
+title: Stream Processing
+nav-id: stream-processing
+nav-pos: 1
+nav-title: Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: Add introduction`
+* This will be replaced by the TOC
+{:toc}
+
+## A Unified System for Batch & Stream Processing
+
+`TODO`
+
+{% top %}
+
+## Programs and Dataflows
+
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
+
+
 
 Review comment:
   In the glossary we call the "streaming dataflow" "logical graph" or 
"jobgraph". Might want to add this here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379566803
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
 
 Review comment:
   ```suggestion
   support use cases where records

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379576359
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379566028
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
 
 Review comment:
   parallel operator instance => sub-task?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379574304
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379582452
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
+have multiple *Flink Masters*, one of which one is always the *leader*, and
+the others are *standby*.
+
+  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data
 
 Review comment:
   according to the glossary "(or more specifically, the subtasks)" does not 
make sense.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379551776
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
 
 Review comment:
   ```suggestion
   across multiple events (for example window operators). These operations are
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379587424
 
 

 ##
 File path: docs/concepts/timely-stream-processing.md
 ##
 @@ -0,0 +1,237 @@
+---
+title: Timely Stream Processing
+nav-id: timely-stream-processing
+nav-pos: 3
+nav-title: Timely Stream Processing
+nav-parent_id: concepts
+---
+
+
+`TODO: add introduction`
+
+* This will be replaced by the TOC
+{:toc}
+
+## Latency & Completeness
+
+`TODO: add these two sections`
+
+### Latency vs. Completeness in Batch & Stream Processing
+
+{% top %}
+
+## Event Time, Processing Time, and Ingestion Time
+
+When referring to time in a streaming program (for example to define windows),
+one can refer to different notions of *time*:
+
+- **Processing time:** Processing time refers to the system time of the machine
+  that is executing the respective operation.
+
+  When a streaming program runs on processing time, all time-based operations
+  (like time windows) will use the system clock of the machines that run the
+  respective operator. An hourly processing time window will include all
+  records that arrived at a specific operator between the times when the system
+  clock indicated the full hour. For example, if an application begins running
+  at 9:15am, the first hourly processing time window will include events
+  processed between 9:15am and 10:00am, the next window will include events
+  processed between 10:00am and 11:00am, and so on.
+
+  Processing time is the simplest notion of time and requires no coordination
+  between streams and machines.  It provides the best performance and the
+  lowest latency. However, in distributed and asynchronous environments
+  processing time does not provide determinism, because it is susceptible to
+  the speed at which records arrive in the system (for example from the message
+  queue), to the speed at which the records flow between operators inside the
+  system, and to outages (scheduled, or otherwise).
+
+- **Event time:** Event time is the time that each individual event occurred on
+  its producing device.  This time is typically embedded within the records
+  before they enter Flink, and that *event timestamp* can be extracted from
+  each record. In event time, the progress of time depends on the data, not on
+  any wall clocks. Event time programs must specify how to generate *Event Time
+  Watermarks*, which is the mechanism that signals progress in event time. This
+  watermarking mechanism is described in a later section,
+  [below](#event-time-and-watermarks).
+
+  In a perfect world, event time processing would yield completely consistent
+  and deterministic results, regardless of when events arrive, or their
+  ordering.  However, unless the events are known to arrive in-order (by
+  timestamp), event time processing incurs some latency while waiting for
+  out-of-order events. As it is only possible to wait for a finite period of
+  time, this places a limit on how deterministic event time applications can
+  be.
+
+  Assuming all of the data has arrived, event time operations will behave as
+  expected, and produce correct and consistent results even when working with
+  out-of-order or late events, or when reprocessing historic data. For example,
+  an hourly event time window will contain all records that carry an event
+  timestamp that falls into that hour, regardless of the order in which they
+  arrive, or when they are processed. (See the section on [late
+  events](#late-elements) for more information.)
+
+
+
+  Note that sometimes when event time programs are processing live data in
+  real-time, they will use some *processing time* operations in order to
+  guarantee that they are progressing in a timely fashion.
+
+- **Ingestion time:** Ingestion time is the time that events enter Flink. At
+  the source operator each record gets the source's current time as a
+  timestamp, and time-based operations (like time windows) refer to that
+  timestamp.
+
+  *Ingestion time* sits conceptually in between *event time* and *processing
+  time*. Compared to *processing time*, it is slightly more expensive, but
+  gives more predictable results. Because *ingestion time* uses stable
+  timestamps (assigned once at the source), different window operations over
+  the records will refer to the same timestamp, whereas in *processing time*
+  each window operator may assign the record to a different window (based on
+  the local system clock and any transport delay).
+
+  Compared to *event time*, *ingestion time* programs cannot handle any
+  out-of-order events or late data, but the programs don't have to specify how
+  to generate *watermarks*.
+
+  Internally, *ingestion time* is treated much like *event time*, but with
+  automatic timestamp assignment and automatic watermark generation.
+
+
+
+{% top %}
+

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379571869
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379584150
 
 

 ##
 File path: docs/concepts/flink-architecture.md
 ##
 @@ -0,0 +1,140 @@
+---
+title: Flink Architecture
+nav-id: flink-architecture
+nav-pos: 4
+nav-title: Flink Architecture
+nav-parent_id: concepts
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Flink Applications and Flink Sessions
+
+`TODO: expand this section`
+
+{% top %}
+
+## Anatomy of a Flink Cluster
+
+`TODO: expand this section, especially about components of the Flink Master and
+container environments`
+
+The Flink runtime consists of two types of processes:
+
+  - The *Flink Master* coordinates the distributed execution. It schedules
+tasks, coordinates checkpoints, coordinates recovery on failures, etc.
+
+There is always at least one *Flink Master*. A high-availability setup will
+have multiple *Flink Masters*, one of which one is always the *leader*, and
+the others are *standby*.
+
+  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data
+*streams*.
+
+There must always be at least one TaskManager.
+
+The Flink Master and TaskManagers can be started in various ways: directly on
+the machines as a [standalone cluster]({{ site.baseurl }}{% link
+ops/deployment/cluster_setup.md %}), in containers, or managed by resource
+frameworks like [YARN]({{ site.baseurl }}{% link ops/deployment/yarn_setup.md
+%}) or [Mesos]({{ site.baseurl }}{% link ops/deployment/mesos.md %}).
+TaskManagers connect to Flink Masters, announcing themselves as available, and
+are assigned work.
+
+The *client* is not part of the runtime and program execution, but is used to
+prepare and send a dataflow to the Flink Master.  After that, the client can
+disconnect, or stay connected to receive progress reports. The client runs
+either as part of the Java/Scala program that triggers the execution, or in the
+command line process `./bin/flink run ...`.
+
+
+
+{% top %}
+
+## Tasks and Operator Chains
+
+For distributed execution, Flink *chains* operator subtasks together into
+*tasks*. Each task is executed by one thread.  Chaining operators together into
+tasks is a useful optimization: it reduces the overhead of thread-to-thread
+handover and buffering, and increases overall throughput while decreasing
+latency.  The chaining behavior can be configured; see the [chaining docs]({{
+site.baseurl }}{% link dev/stream/operators/index.md
+%}#task-chaining-and-resource-groups) for details.
+
+The sample dataflow in the figure below is executed with five subtasks, and
+hence with five parallel threads.
+
+
+
+{% top %}
+
+## Task Slots and Resources
+
+Each worker (TaskManager) is a *JVM process*, and may execute one or more
+subtasks in separate threads.  To control how many tasks a worker accepts, a
+worker has so called **task slots** (at least one).
+
+Each *task slot* represents a fixed subset of resources of the TaskManager. A
+TaskManager with three slots, for example, will dedicate 1/3 of its managed
+memory to each slot. Slotting the resources means that a subtask will not
+compete with subtasks from other jobs for managed memory, but instead has a
+certain amount of reserved managed memory. Note that no CPU isolation happens
+here; currently slots only separate the managed memory of tasks.
+
+By adjusting the number of task slots, users can define how subtasks are
+isolated from each other.  Having one slot per TaskManager means each task
+group runs in a separate JVM (which can be started in a separate container, for
+example). Having multiple slots means more subtasks share the same JVM. Tasks
+in the same JVM share TCP connections (via multiplexing) and heartbeat
+messages. They may also share data sets and data structures, thus reducing the
+per-task overhead.
+
+
+
+By default, Flink allows subtasks to share slots even if they are subtasks of
+different tasks, so long as they are from the same job. The result is that one
 
 Review comment:
   ```suggestion
   different operators, so long as they are from the same job. The result is 
that one
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379558964
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
 
 Review comment:
   I would remove this section completely. It tries to cover too much to early. 
There is a section about Statebackends where this can be explained in more 
detail.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379569140
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379553632
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
 
 Review comment:
   This paragraph feels repetitive. Maybe mention that while the output of a 
stateless function only depends on the input, the output of a stateful function 
depends on the input as well as its current state. As the state is a function 
full history of events, the output of a stateful function can depend on the the 
input as well as all previous inputs.
   
   Maybe this is too theoretical or mathematica. It is a slightly different 
view on it though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379561027
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
 
 Review comment:
   ```suggestion
   state is only possible on *keyed streams*, i.e. after a keyed partitioned 
data exchange, and is
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379574851
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379570828
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379567553
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
+%}) of streaming applications.
+
+Knowledge about the state also allows for rescaling Flink applications, meaning
+that Flink takes care of redistributing state across parallel instances.
+
+The [queryable state]({{ site.baseurl }}{% link
+dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
+state from outside of Flink during runtime.
+
+When working with state, it might also be useful to read about [Flink's state
+backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
+provides different state backends that specify how and where state is stored.
+State can be located on Java's heap or off-heap. Depending on your state
+backend, Flink can also *manage* the state for the application, meaning Flink
+deals with the memory management (possibly spilling to disk if necessary) to
+allow applications to hold very large state. State backends can be configured
+without changing your application logic.
+
+* This will be replaced by the TOC
+{:toc}
+
+## What is State?
+
+`TODO: expand this section`
+
+There are different types of state in Flink, the most-used type of state is
+*Keyed State*. For special cases you can use *Operator State* and *Broadcast
+State*. *Broadcast State* is a special type of *Operator State*.
+
+{% top %}
+
+## State in Stream & Batch Processing
+
+`TODO: What is this section about? Do we even need it?`
+
+{% top %}
+
+## Keyed State
+
+Keyed state is maintained in what can be thought of as an embedded key/value
+store.  The state is partitioned and distributed strictly together with the
+streams that are read by the stateful operators. Hence, access to the key/value
+state is only possible on *keyed streams*, after a *keyBy()* function, and is
+restricted to the values associated with the current event's key. Aligning the
+keys of streams and state makes sure that all state updates are local
+operations, guaranteeing consistency without transaction overhead.  This
+alignment also allows Flink to redistribute the state and adjust the stream
+partitioning transparently.
+
+
+
+Keyed State is further organized into so-called *Key Groups*. Key Groups are
+the atomic unit by which Flink can redistribute Keyed State; there are exactly
+as many Key Groups as the defined maximum parallelism.  During execution each
+parallel instance of a keyed operator works with the keys for one or more Key
+Groups.
+
+`TODO: potentially leave out Operator State and Broadcast State from concepts 
documentation`
+
+## Operator State
+
+*Operator State* (or *non-keyed state*) is state that is is bound to one
+parallel operator instance.  The [Kafka Connector]({{ site.baseurl }}{% link
+dev/connectors/kafka.md %}) is a good motivating example for the use of
+Operator State in Flink. Each parallel instance of the Kafka consumer maintains
+a map of topic partitions and offsets as its Operator State.
+
+The Operator State interfaces support redistributing state among parallel
+operator instances when the parallelism is changed. There can be different
+schemes for doing this redistribution.
+
+## Broadcast State
+
+*Broadcast State* is a special type of *Operator State*.  It was introduced to
+support use cases where some data coming from one stream is required to be
+broadcasted to all downstream tasks, where it is stored locally and

[GitHub] [flink] knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section

2020-02-14 Thread GitBox

knaufk commented on a change in pull request #11092: [FLINK-15999] Extract 
“Concepts” material from API/Library sections and start proper concepts section
URL: https://github.com/apache/flink/pull/11092#discussion_r379555751
 
 

 ##
 File path: docs/concepts/stateful-stream-processing.md
 ##
 @@ -0,0 +1,412 @@
+---
+title: Stateful Stream Processing
+nav-id: stateful-stream-processing
+nav-pos: 2
+nav-title: Stateful Stream Processing
+nav-parent_id: concepts
+---
+
+
+While many operations in a dataflow simply look at one individual *event at a
+time* (for example an event parser), some operations remember information
+across multiple events (for example window operators).  These operations are
+called **stateful**.
+
+Stateful functions and operators store data across the processing of individual
+elements/events, making state a critical building block for any type of more
+elaborate operation.
+
+For example:
+
+  - When an application searches for certain event patterns, the state will
+store the sequence of events encountered so far.
+  - When aggregating events per minute/hour/day, the state holds the pending
+aggregates.
+  - When training a machine learning model over a stream of data points, the
+state holds the current version of the model parameters.
+  - When historic data needs to be managed, the state allows efficient access
+to events that occurred in the past.
+
+Flink needs to be aware of the state in order to make state fault tolerant
+using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
+%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
 
 Review comment:
   "Flink needs to be aware of the state in order to allow savepoints." does 
not make sense to me."


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

52 matches

Mail list logo