[jira] [Commented] (FLINK-2445) Add tests for HadoopOutputFormats
[ https://issues.apache.org/jira/browse/FLINK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210737#comment-15210737 ] ASF GitHub Bot commented on FLINK-2445: --- Github user mliesenberg closed the pull request at: https://github.com/apache/flink/pull/1625 > Add tests for HadoopOutputFormats > - > > Key: FLINK-2445 > URL: https://issues.apache.org/jira/browse/FLINK-2445 > Project: Flink > Issue Type: Test > Components: Hadoop Compatibility, Tests >Affects Versions: 0.9.1, 0.10.0 >Reporter: Fabian Hueske >Assignee: Martin Liesenberg > Labels: starter > Fix For: 1.1.0, 1.0.1 > > > The HadoopOutputFormats and HadoopOutputFormatBase classes are not > sufficiently covered by unit tests. > We need tests that ensure that the methods of the wrapped Hadoop > OutputFormats are correctly called. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2445] Add tests for HadoopOutputFormats
Github user mliesenberg closed the pull request at: https://github.com/apache/flink/pull/1625 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2445) Add tests for HadoopOutputFormats
[ https://issues.apache.org/jira/browse/FLINK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210724#comment-15210724 ] ASF GitHub Bot commented on FLINK-2445: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1625#issuecomment-200961463 Hi @mliesenberg, I merged this PR but it wasn't correctly closed. Can you do it? Thanks > Add tests for HadoopOutputFormats > - > > Key: FLINK-2445 > URL: https://issues.apache.org/jira/browse/FLINK-2445 > Project: Flink > Issue Type: Test > Components: Hadoop Compatibility, Tests >Affects Versions: 0.9.1, 0.10.0 >Reporter: Fabian Hueske >Assignee: Martin Liesenberg > Labels: starter > Fix For: 1.1.0, 1.0.1 > > > The HadoopOutputFormats and HadoopOutputFormatBase classes are not > sufficiently covered by unit tests. > We need tests that ensure that the methods of the wrapped Hadoop > OutputFormats are correctly called. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2445] Add tests for HadoopOutputFormats
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1625#issuecomment-200961463 Hi @mliesenberg, I merged this PR but it wasn't correctly closed. Can you do it? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200951040 Thanks for describing this. This has quite some big implications, as far as I can see it. The state in the connected stream is now a "broadcast state" not partitioned, so allowing to do that on the key/value state probably breaks some ongoing efforts, like scaling, etc. How about a more clean separation of these things: 1. The connected streams (fully partitioned or not) 2. Broadcast inputs, which are similar to the "side inputs" in cloud dataflow or the broadcast variables in DataSet. That gives us - clean semantics, behavior that users can work with - We do not overcomplicate the key/value abstraction - and we can also make sure we checkpoint the broadcast state once only (rather than on each parallel subtask). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210655#comment-15210655 ] ASF GitHub Bot commented on FLINK-3659: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200951040 Thanks for describing this. This has quite some big implications, as far as I can see it. The state in the connected stream is now a "broadcast state" not partitioned, so allowing to do that on the key/value state probably breaks some ongoing efforts, like scaling, etc. How about a more clean separation of these things: 1. The connected streams (fully partitioned or not) 2. Broadcast inputs, which are similar to the "side inputs" in cloud dataflow or the broadcast variables in DataSet. That gives us - clean semantics, behavior that users can work with - We do not overcomplicate the key/value abstraction - and we can also make sure we checkpoint the broadcast state once only (rather than on each parallel subtask). > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210628#comment-15210628 ] ASF GitHub Bot commented on FLINK-3659: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200945681 I like @aljoscha's idea to separate more explicitly the user state access and it's implementation. Having an accessor would also allow us to get rid of the swapping of the actual state objects which are wrapped by the `State` objects. Then the implementation of a `StateBackend` wouldn't have to be spread out over the `KvState` classes anymore. This again would make it easier to integrate the notion of virtual state partitions/shards into `StateBackends`. So in general, I think it would simplify our current `StateBackend` implementations noticeable. > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200945681 I like @aljoscha's idea to separate more explicitly the user state access and it's implementation. Having an accessor would also allow us to get rid of the swapping of the actual state objects which are wrapped by the `State` objects. Then the implementation of a `StateBackend` wouldn't have to be spread out over the `KvState` classes anymore. This again would make it easier to integrate the notion of virtual state partitions/shards into `StateBackends`. So in general, I think it would simplify our current `StateBackend` implementations noticeable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3547) Add support for streaming projection, selection, and union
[ https://issues.apache.org/jira/browse/FLINK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210612#comment-15210612 ] ASF GitHub Bot commented on FLINK-3547: --- Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200942437 I've rebased on master and will merge once travis turns green :) > Add support for streaming projection, selection, and union > -- > > Key: FLINK-3547 > URL: https://issues.apache.org/jira/browse/FLINK-3547 > Project: Flink > Issue Type: Sub-task > Components: Table API >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3547] add support for streaming filter,...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200942437 I've rebased on master and will merge once travis turns green :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3667) Generalize client<->cluster communication
[ https://issues.apache.org/jira/browse/FLINK-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210591#comment-15210591 ] Till Rohrmann commented on FLINK-3667: -- Sounds like a good idea. +1 :-) > Generalize client<->cluster communication > - > > Key: FLINK-3667 > URL: https://issues.apache.org/jira/browse/FLINK-3667 > Project: Flink > Issue Type: Improvement > Components: YARN Client >Reporter: Maximilian Michels >Assignee: Maximilian Michels > > Here are some notes I took when inspecting the client<->cluster classes with > regard to future integration of other resource management frameworks in > addition to Yarn (e.g. Mesos). > {noformat} > 1 Cluster Client Abstraction > > 1.1 Status Quo > ── > 1.1.1 FlinkYarnClient > ╌ > • Holds the cluster configuration (Flink-specific and Yarn-specific) > • Contains the deploy() method to deploy the cluster > • Creates the Hadoop Yarn client > • Receives the initial job manager address > • Bootstraps the FlinkYarnCluster > 1.1.2 FlinkYarnCluster > ╌╌ > • Wrapper around the Hadoop Yarn client > • Queries cluster for status updates > • Life time methods to start and shutdown the cluster > • Flink specific features like shutdown after job completion > 1.1.3 ApplicationClient > ╌╌╌ > • Acts as a middle-man for asynchronous cluster communication > • Designed to communicate with Yarn, not used in Standalone mode > 1.1.4 CliFrontend > ╌ > • Deeply integrated with FlinkYarnClient and FlinkYarnCluster > • Constantly distinguishes between Yarn and Standalone mode > • Would be nice to have a general abstraction in place > 1.1.5 Client > > • Job submission and Job related actions, agnostic of resource framework > 1.2 Proposal > > 1.2.1 ClusterConfig (before: AbstractFlinkYarnClient) > ╌ > • Extensible cluster-agnostic config > • May be extended by specific cluster, e.g. YarnClusterConfig > 1.2.2 ClusterClient (before: AbstractFlinkYarnClient) > ╌ > • Deals with cluster (RM) specific communication > • Exposes framework agnostic information > • YarnClusterClient, MesosClusterClient, StandaloneClusterClient > 1.2.3 FlinkCluster (before: AbstractFlinkYarnCluster) > ╌ > • Basic interface to communicate with a running cluster > • Receives the ClusterClient for cluster-specific communication > • Should not have to care about the specific implementations of the > client > 1.2.4 ApplicationClient > ╌╌╌ > • Can be changed to work cluster-agnostic (first steps already in > FLINK-3543) > 1.2.5 CliFrontend > ╌ > • CliFrontend does never have to differentiate between different > cluster types after it has determined which cluster class to load. > • Base class handles framework agnostic command line arguments > • Pluggables for Yarn, Mesos handle specific commands > {noformat} > I would like to create/refactor the affected classes to set us up for a more > flexible client side resource management abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210586#comment-15210586 ] ASF GitHub Bot commented on FLINK-3659: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200933436 To elaborate on this. State right now works well if you stick to the (admittedly somewhat hidden) rules. That is, you should only access state if there is a key available. If there is no key available the behavior changes in unexpected ways based on what state backend is used and the capabilities of the key serializer. For example, let's look at access to `ValueState` in `open()`. For mem/fs state: `ValueState.value()` works, it will return the default value. `ValueState.update()` will throw a NPE. For RocksDB state: Neither method works if the key serializer cannot handle null values. If it can, then both methods will change state for the `null` key. For these reasons I would like to change the semantics of state such that the user always has to call `getState` (or a similar method) and that the returned accessor object is documented to only be valid for the duration of the processing method. Right now, the user can wreak all kinds of havoc by down-casting the returned State object. Right now we have a very simple system that works if the user keeps to the rules and also makes things go fast. If we want to make it more restrictive we will lose some performance, of course. > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200933436 To elaborate on this. State right now works well if you stick to the (admittedly somewhat hidden) rules. That is, you should only access state if there is a key available. If there is no key available the behavior changes in unexpected ways based on what state backend is used and the capabilities of the key serializer. For example, let's look at access to `ValueState` in `open()`. For mem/fs state: `ValueState.value()` works, it will return the default value. `ValueState.update()` will throw a NPE. For RocksDB state: Neither method works if the key serializer cannot handle null values. If it can, then both methods will change state for the `null` key. For these reasons I would like to change the semantics of state such that the user always has to call `getState` (or a similar method) and that the returned accessor object is documented to only be valid for the duration of the processing method. Right now, the user can wreak all kinds of havoc by down-casting the returned State object. Right now we have a very simple system that works if the user keeps to the rules and also makes things go fast. If we want to make it more restrictive we will lose some performance, of course. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (FLINK-3639) Add methods and utilities to register DataSets and Tables in the TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vasia Kalavri resolved FLINK-3639. -- Resolution: Implemented Fix Version/s: 1.1.0 > Add methods and utilities to register DataSets and Tables in the > TableEnvironment > - > > Key: FLINK-3639 > URL: https://issues.apache.org/jira/browse/FLINK-3639 > Project: Flink > Issue Type: New Feature > Components: Table API >Affects Versions: 1.1.0 >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > Fix For: 1.1.0 > > > In order to make tables queryable from SQL we need to register them under a > unique name in the TableEnvironment. > [This design > document|https://docs.google.com/document/d/1sITIShmJMGegzAjGqFuwiN_iw1urwykKsLiacokxSw0/edit] > describes the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3639) Add methods and utilities to register DataSets and Tables in the TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210528#comment-15210528 ] ASF GitHub Bot commented on FLINK-3639: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1827 > Add methods and utilities to register DataSets and Tables in the > TableEnvironment > - > > Key: FLINK-3639 > URL: https://issues.apache.org/jira/browse/FLINK-3639 > Project: Flink > Issue Type: New Feature > Components: Table API >Affects Versions: 1.1.0 >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > > In order to make tables queryable from SQL we need to register them under a > unique name in the TableEnvironment. > [This design > document|https://docs.google.com/document/d/1sITIShmJMGegzAjGqFuwiN_iw1urwykKsLiacokxSw0/edit] > describes the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3639] add methods for registering datas...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1827 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3257) Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210494#comment-15210494 ] ASF GitHub Bot commented on FLINK-3257: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1668#issuecomment-200909639 The core idea of this is very good, also the illustration is very nice. After an offline chat with @senorcarbone, we concluded that a remaining problem in this is currently the way it integrates with the timeout-based termination detection. Which brings us to the point that we should (in my opinion) change the way that loops terminate. It should probably be based on end-of-stream events, to make it deterministic and not susceptible to delays. Question is now, does it make sense to do the termination change first, and base this on top of it, or to merge this irrespective of that... > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > --- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement >Reporter: Paris Carbone >Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3257] Add Exactly-Once Processing Guara...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1668#issuecomment-200909639 The core idea of this is very good, also the illustration is very nice. After an offline chat with @senorcarbone, we concluded that a remaining problem in this is currently the way it integrates with the timeout-based termination detection. Which brings us to the point that we should (in my opinion) change the way that loops terminate. It should probably be based on end-of-stream events, to make it deterministic and not susceptible to delays. Question is now, does it make sense to do the termination change first, and base this on top of it, or to merge this irrespective of that... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user gyfora commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200907475 @StephanEwen This PR does not change the behaviour of any existing Flink applications. It now allows though that users only specify key of one input of the comapfunctions for instance: `input1.keyBy(...).connect(input2.broadcast())` This was previously impossible which probably blocks many existing use-cases (actually blocking one of my own applications that I try to build on Flink) where one input does not have associated state. After this change the key-value state defined by the keyed input stream works as expected. The only not so fortunate behaviour is that users can still call state.value() for inputs from the non-keyed stream and the behaviour is not clearly defined. If there was already input from the other side it returns the state for the last key, otherwise it will probably throw a nullpointer exception. I think this is acceptable behaviour for the time being because well written programs will work as expected. We can think about how we want to handle the other non-keyed input but that will probably include changing many things in the KvBackends so they can do this properly. This problem already exists in flink as state access outside of the processing method is not well defined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210480#comment-15210480 ] ASF GitHub Bot commented on FLINK-3659: --- Github user gyfora commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200907475 @StephanEwen This PR does not change the behaviour of any existing Flink applications. It now allows though that users only specify key of one input of the comapfunctions for instance: `input1.keyBy(...).connect(input2.broadcast())` This was previously impossible which probably blocks many existing use-cases (actually blocking one of my own applications that I try to build on Flink) where one input does not have associated state. After this change the key-value state defined by the keyed input stream works as expected. The only not so fortunate behaviour is that users can still call state.value() for inputs from the non-keyed stream and the behaviour is not clearly defined. If there was already input from the other side it returns the state for the last key, otherwise it will probably throw a nullpointer exception. I think this is acceptable behaviour for the time being because well written programs will work as expected. We can think about how we want to handle the other non-keyed input but that will probably include changing many things in the KvBackends so they can do this properly. This problem already exists in flink as state access outside of the processing method is not well defined. > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210412#comment-15210412 ] ASF GitHub Bot commented on FLINK-3659: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200887743 Can someone elaborate on the semantics? I am against merging something that changes semantics and has zero description. > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3664) Create a method to easily Summarize a DataSet
[ https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210409#comment-15210409 ] Todd Lisonbee commented on FLINK-3664: -- Hi Fabian, thanks for the feedback. Your first 3 comments all make sense - agreed. On distinct counts, I thought about it but wasn't sure so I left it out for now. For an approximate, the best idea I had was to choose some arbitrary number, maybe 100. And then just report the exact number of distinct values if less than 100, or to say 100+ if greater than 100. This would be nice for categorical variables that happen to have less than 100 different values. But with enough rows and columns it could be expensive (even if Tuple is currently limited to 22) or at least relatively more expensive than the other calculations. There isn't a perfect magic number. I didn't like this idea all of the way. Do you know of a nice way to approximate distinct counts? Thanks. > Create a method to easily Summarize a DataSet > - > > Key: FLINK-3664 > URL: https://issues.apache.org/jira/browse/FLINK-3664 > Project: Flink > Issue Type: Improvement >Reporter: Todd Lisonbee > Attachments: DataSet-Summary-Design-March2016-v1.txt > > > Here is an example: > {code} > /** > * Summarize a DataSet of Tuples by collecting single pass statistics for all > columns > */ > public Tuple summarize() > Dataset> input = // [...] > Tuple3 summary > = input.summarize() > summary.getField(0).stddev() > summary.getField(1).maxStringLength() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200887743 Can someone elaborate on the semantics? I am against merging something that changes semantics and has zero description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210406#comment-15210406 ] ASF GitHub Bot commented on FLINK-3659: --- Github user gyfora commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200886203 I think you can go ahead merging this if no-one has any objections :) > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user gyfora commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200886203 I think you can go ahead merging this if no-one has any objections :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3659] Allow ConnectedStreams to Be Keye...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200888107 For example, how does keyed state work for the input side that is not key partitioned? How is the key found? How is partitioning guaranteed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3659) Allow ConnectedStreams to Be Keyed on Only One Side
[ https://issues.apache.org/jira/browse/FLINK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210413#comment-15210413 ] ASF GitHub Bot commented on FLINK-3659: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1831#issuecomment-200888107 For example, how does keyed state work for the input side that is not key partitioned? How is the key found? How is partitioning guaranteed? > Allow ConnectedStreams to Be Keyed on Only One Side > --- > > Key: FLINK-3659 > URL: https://issues.apache.org/jira/browse/FLINK-3659 > Project: Flink > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > Currently, we only allow partitioned state when both inputs are keyed (with > the same key type). I think a valid use case is to have only one side keyed > and have the other side broadcast to publish some updates that are relevant > for all keys. > When relaxing the requirement to have only one side keyed we must still > ensure that the same key is used if both sides are keyed. > [~gyfora] Did you try this with what you're working on? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3257] Add Exactly-Once Processing Guara...
Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57327419 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamIterationHead.java --- @@ -17,43 +17,80 @@ package org.apache.flink.streaming.runtime.tasks; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.BlockingQueue; -import java.util.concurrent.TimeUnit; - import org.apache.flink.annotation.Internal; import org.apache.flink.api.common.JobID; +import org.apache.flink.core.memory.DataInputViewStreamWrapper; +import org.apache.flink.core.memory.DataOutputViewStreamWrapper; +import org.apache.flink.runtime.io.network.api.CheckpointBarrier; +import org.apache.flink.runtime.state.AbstractStateBackend; +import org.apache.flink.runtime.state.StreamStateHandle; +import org.apache.flink.streaming.api.operators.AbstractStreamOperator; +import org.apache.flink.streaming.api.operators.OneInputStreamOperator; import org.apache.flink.streaming.api.watermark.Watermark; -import org.apache.flink.streaming.runtime.io.RecordWriterOutput; import org.apache.flink.streaming.runtime.io.BlockingQueueBroker; +import org.apache.flink.streaming.runtime.io.RecordWriterOutput; import org.apache.flink.streaming.runtime.streamrecord.StreamRecord; +import org.apache.flink.streaming.runtime.streamrecord.StreamRecordSerializer; +import org.apache.flink.types.Either; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.TimeUnit; + @Internal -public class StreamIterationHead extends OneInputStreamTask{ +public class StreamIterationHead extends OneInputStreamTask { private static final Logger LOG = LoggerFactory.getLogger(StreamIterationHead.class); private volatile boolean running = true; - // - + /** +* A flag that is on during the duration of a checkpoint. While onSnapshot is true the iteration head has to perform +* upstream backup of all records in transit within the loop. +*/ + private volatile boolean onSnapshot = false; --- End diff -- Always accessed in lock scope, no need for `volatile` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations
[ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210344#comment-15210344 ] Fabian Hueske commented on FLINK-3613: -- Hi [~tlisonbee], thanks for the detailed design doc! I'll give some feedback soon. > Add standard deviation, mean, variance to list of Aggregations > -- > > Key: FLINK-3613 > URL: https://issues.apache.org/jira/browse/FLINK-3613 > Project: Flink > Issue Type: Improvement >Reporter: Todd Lisonbee >Priority: Minor > Attachments: DataSet-Aggregation-Design-March2016-v1.txt > > > Implement standard deviation, mean, variance for > org.apache.flink.api.java.aggregation.Aggregations > Ideally implementation should be single pass and numerically stable. > References: > "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et > al, International Conference on Data Engineering 2012 > http://dl.acm.org/citation.cfm?id=2310392 > "The Kahan summation algorithm (also known as compensated summation) reduces > the numerical errors that occur when adding a sequence of finite precision > floating point numbers. Numerical errors arise due to truncation and > rounding. These errors can lead to numerical instability when calculating > variance." > https://en.wikipedia.org/wiki/Kahan_summation_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3664) Create a method to easily Summarize a DataSet
[ https://issues.apache.org/jira/browse/FLINK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210343#comment-15210343 ] Fabian Hueske commented on FLINK-3664: -- Hi [~tlisonbee], I like this proposal a lot. It would be a great feature to add. A few comments on you proposal: - If possible use a {{CombineFunction}} instead of a {{GroupCombineFunction}}. {{CombineFunction}} can be executed using a hash-based execution strategy whereas {{GroupCombineFunction}} must be evaluated with a sort-based strategy. Hash-based combiners are not supported yet but will hopefully be added soon (see PR #1517). - We add new functionality such as {{summarize()}} to {{DataSetUtils}} and not directly to {{DataSet}}. {{DataSetUtils}} serves kind of as a staging area for new features where we are not sure yet whether we want to add them to the core API. - It is certainly possible to implement the feature as {{CustomUnaryOperation}} but it is also fine to directly call the API methods. - Do you think it makes sense to include (approximate) distinct counts for integer and String values or should this as a separate method? > Create a method to easily Summarize a DataSet > - > > Key: FLINK-3664 > URL: https://issues.apache.org/jira/browse/FLINK-3664 > Project: Flink > Issue Type: Improvement >Reporter: Todd Lisonbee > Attachments: DataSet-Summary-Design-March2016-v1.txt > > > Here is an example: > {code} > /** > * Summarize a DataSet of Tuples by collecting single pass statistics for all > columns > */ > public Tuple summarize() > Dataset> input = // [...] > Tuple3 summary > = input.summarize() > summary.getField(0).stddev() > summary.getField(1).maxStringLength() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3257) Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210328#comment-15210328 ] ASF GitHub Bot commented on FLINK-3257: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57327419 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamIterationHead.java --- @@ -17,43 +17,80 @@ package org.apache.flink.streaming.runtime.tasks; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.BlockingQueue; -import java.util.concurrent.TimeUnit; - import org.apache.flink.annotation.Internal; import org.apache.flink.api.common.JobID; +import org.apache.flink.core.memory.DataInputViewStreamWrapper; +import org.apache.flink.core.memory.DataOutputViewStreamWrapper; +import org.apache.flink.runtime.io.network.api.CheckpointBarrier; +import org.apache.flink.runtime.state.AbstractStateBackend; +import org.apache.flink.runtime.state.StreamStateHandle; +import org.apache.flink.streaming.api.operators.AbstractStreamOperator; +import org.apache.flink.streaming.api.operators.OneInputStreamOperator; import org.apache.flink.streaming.api.watermark.Watermark; -import org.apache.flink.streaming.runtime.io.RecordWriterOutput; import org.apache.flink.streaming.runtime.io.BlockingQueueBroker; +import org.apache.flink.streaming.runtime.io.RecordWriterOutput; import org.apache.flink.streaming.runtime.streamrecord.StreamRecord; +import org.apache.flink.streaming.runtime.streamrecord.StreamRecordSerializer; +import org.apache.flink.types.Either; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.TimeUnit; + @Internal -public class StreamIterationHead extends OneInputStreamTask{ +public class StreamIterationHead extends OneInputStreamTask { private static final Logger LOG = LoggerFactory.getLogger(StreamIterationHead.class); private volatile boolean running = true; - // - + /** +* A flag that is on during the duration of a checkpoint. While onSnapshot is true the iteration head has to perform +* upstream backup of all records in transit within the loop. +*/ + private volatile boolean onSnapshot = false; --- End diff -- Always accessed in lock scope, no need for `volatile` > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > --- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement >Reporter: Paris Carbone >Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3667) Generalize client<->cluster communication
Maximilian Michels created FLINK-3667: - Summary: Generalize client<->cluster communication Key: FLINK-3667 URL: https://issues.apache.org/jira/browse/FLINK-3667 Project: Flink Issue Type: Improvement Components: YARN Client Reporter: Maximilian Michels Assignee: Maximilian Michels Here are some notes I took when inspecting the client<->cluster classes with regard to future integration of other resource management frameworks in addition to Yarn (e.g. Mesos). {noformat} 1 Cluster Client Abstraction 1.1 Status Quo ── 1.1.1 FlinkYarnClient ╌ • Holds the cluster configuration (Flink-specific and Yarn-specific) • Contains the deploy() method to deploy the cluster • Creates the Hadoop Yarn client • Receives the initial job manager address • Bootstraps the FlinkYarnCluster 1.1.2 FlinkYarnCluster ╌╌ • Wrapper around the Hadoop Yarn client • Queries cluster for status updates • Life time methods to start and shutdown the cluster • Flink specific features like shutdown after job completion 1.1.3 ApplicationClient ╌╌╌ • Acts as a middle-man for asynchronous cluster communication • Designed to communicate with Yarn, not used in Standalone mode 1.1.4 CliFrontend ╌ • Deeply integrated with FlinkYarnClient and FlinkYarnCluster • Constantly distinguishes between Yarn and Standalone mode • Would be nice to have a general abstraction in place 1.1.5 Client • Job submission and Job related actions, agnostic of resource framework 1.2 Proposal 1.2.1 ClusterConfig (before: AbstractFlinkYarnClient) ╌ • Extensible cluster-agnostic config • May be extended by specific cluster, e.g. YarnClusterConfig 1.2.2 ClusterClient (before: AbstractFlinkYarnClient) ╌ • Deals with cluster (RM) specific communication • Exposes framework agnostic information • YarnClusterClient, MesosClusterClient, StandaloneClusterClient 1.2.3 FlinkCluster (before: AbstractFlinkYarnCluster) ╌ • Basic interface to communicate with a running cluster • Receives the ClusterClient for cluster-specific communication • Should not have to care about the specific implementations of the client 1.2.4 ApplicationClient ╌╌╌ • Can be changed to work cluster-agnostic (first steps already in FLINK-3543) 1.2.5 CliFrontend ╌ • CliFrontend does never have to differentiate between different cluster types after it has determined which cluster class to load. • Base class handles framework agnostic command line arguments • Pluggables for Yarn, Mesos handle specific commands {noformat} I would like to create/refactor the affected classes to set us up for a more flexible client side resource management abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3633) Job submission silently fails when using user code types
[ https://issues.apache.org/jira/browse/FLINK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210320#comment-15210320 ] Till Rohrmann commented on FLINK-3633: -- Yes that should work. Will try to update the PR. Thanks for the pointer :-) On Thu, Mar 24, 2016 at 2:23 PM, ASF GitHub Bot (JIRA)> Job submission silently fails when using user code types > > > Key: FLINK-3633 > URL: https://issues.apache.org/jira/browse/FLINK-3633 > Project: Flink > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Fix For: 1.1.0 > > > With the changes introduced by FLINK-3327, it is no longer possible to run > remote Flink jobs which work on user code types. The reason is that now the > {{ExecutionConfig}} is directly stored in the {{JobGraph}} which is sent as > an Akka message to the {{JobManager}}. Per default, user code types are > automatically detected and registered in the {{ExecutionConfig}}. When > deserializing a {{JobGraph}} whose {{ExecutionConfig}} contains user code > classes the user code class loader is consequently required. However, Akka > does not have access to it and uses the system class loader. This causes that > Akka silently discards the {{SubmitJob}} message which cannot be deserialized > because of a {{ClassNotFoundException}}. > I propose to not sent the {{ExecutionConfig}} explicitly with the > {{JobGraph}} and, thus, to partially revert the changes to before FLINK-3327. > Before, the {{ExectuionConfig}} was serialized into the job configuration and > deserialized on the {{TaskManager}} using the proper user code class loader. > In order to reproduce the problem you can submit the following job to a > remote cluster. > {code} > public class Job { > public static class CustomType { > private final int value; > public CustomType(int value) { > this.value = value; > } > @Override > public String toString() { > return "CustomType(" + value + ")"; > } > } > public static void main(String[] args) throws Exception { > ExecutionEnvironment env = > ExecutionEnvironment.createRemoteEnvironment(Address, Port, PathToJar); > env.getConfig().disableAutoTypeRegistration(); > DataSet input = env.fromElements(1,2,3,4,5); > DataSet customTypes = input.map(new > MapFunction () { > @Override > public CustomType map(Integer integer) throws Exception > { > return new CustomType(integer); > } > }); > customTypes.print(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3639] add methods for registering datas...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1827#issuecomment-200864343 Merging! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3651) Fix faulty RollingSink Restore
[ https://issues.apache.org/jira/browse/FLINK-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210309#comment-15210309 ] ASF GitHub Bot commented on FLINK-3651: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200863307 Yes, what @rmetzger said is correct. Without the fix and the increase of parallelism it will be a flaky test that still succeeds sometimes. > Fix faulty RollingSink Restore > -- > > Key: FLINK-3651 > URL: https://issues.apache.org/jira/browse/FLINK-3651 > Project: Flink > Issue Type: Bug > Components: Streaming >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > The RollingSink restore logic has a bug where the sink for subtask index 1 > also removes files for subtask index 11 because the regex that checks for the > file name also matches that one. Adding the suffix to the regex should solve > the problem because then the regex for 1 will only match files for subtask > index 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3651] Fix faulty RollingSink Restore
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200863307 Yes, what @rmetzger said is correct. Without the fix and the increase of parallelism it will be a flaky test that still succeeds sometimes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210282#comment-15210282 ] ASF GitHub Bot commented on FLINK-1745: --- Github user danielblazevski commented on the pull request: https://github.com/apache/flink/pull/1220#issuecomment-200854252 @tillrohrmann @chiwanpark done, polished up KNN.scala and some minor changes -- e.g. expanding the description of the parameters in the beginning of KNN.scala. Looking forward to doing the approximate version. I ran some tests last week of the pure Scala z-value KNN and it looks promising (https://github.com/danielblazevski/zknn-scala) > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1745] Add exact k-nearest-neighbours al...
Github user danielblazevski commented on the pull request: https://github.com/apache/flink/pull/1220#issuecomment-200854252 @tillrohrmann @chiwanpark done, polished up KNN.scala and some minor changes -- e.g. expanding the description of the parameters in the beginning of KNN.scala. Looking forward to doing the approximate version. I ran some tests last week of the pure Scala z-value KNN and it looks promising (https://github.com/danielblazevski/zknn-scala) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3257) Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210279#comment-15210279 ] ASF GitHub Bot commented on FLINK-3257: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57321347 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- When rebasing we have to double check that nothing changed in this method when calling from `triggerCheckpoint` etc. > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > --- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement >Reporter: Paris Carbone >Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3257] Add Exactly-Once Processing Guara...
Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57321347 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- When rebasing we have to double check that nothing changed in this method when calling from `triggerCheckpoint` etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-3666) Remove Nephele references
Chesnay Schepler created FLINK-3666: --- Summary: Remove Nephele references Key: FLINK-3666 URL: https://issues.apache.org/jira/browse/FLINK-3666 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Chesnay Schepler Priority: Trivial There still exist a few references to nephele which should be removed: {code} flink\docs\setup\local_setup.md: 79 $ tail log/flink-*-jobmanager-*.log 80 INFO ... - Initializing memory manager with 409 megabytes of memory 81: INFO ... - Trying to load org.apache.flinknephele.jobmanager.scheduler.local.LocalScheduler as scheduler 82 INFO ... - Setting up web info server, using web-root directory ... 83: INFO ... - Web info server will display information about nephele job-manager on localhost, port 8081. 84 INFO ... - Starting web info server for JobManager on port 8081 85 ~~~ .. 118 $ cd flink 119 $ bin/start-local.sh 120: Starting Nephele job manager 121 ~~~ {code} {code} flink\flink-runtime\src\main\java\org\apache\flink\runtime\operators\TaskContext.java: 70: AbstractInvokable getOwningNepheleTask(); {code} {code} flink\flink-runtime\src\main\java\org\apache\flink\runtime\operators\BatchTask.java: 1149* @param message The main message for the log. 1150* @param taskName The name of the task. 1151: * @param parent The nephele task that contains the code producing the message. 1152* 1153* @return The string for logging. 1254*/ 1255 @SuppressWarnings("unchecked") 1256: public static Collector initOutputs(AbstractInvokable nepheleTask, ClassLoader cl, TaskConfig config, 1257 ListchainedTasksTarget, 1258 List eventualOutputs, {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3257] Add Exactly-Once Processing Guara...
Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57320231 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- Regarding the JavaDocs: - The idomiatic style is to have a short description and then a blank line for more details (the first line will be displayed as a summary in the IDE etc.) - The `of the current StreamTask` is clear from context - The Thread-safety part should be more explicit, for instance `The caller has to make sure to call this method in scope of the task's checkpoint lock`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3257) Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210270#comment-15210270 ] ASF GitHub Bot commented on FLINK-3257: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57320231 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- Regarding the JavaDocs: - The idomiatic style is to have a short description and then a blank line for more details (the first line will be displayed as a summary in the IDE etc.) - The `of the current StreamTask` is clear from context - The Thread-safety part should be more explicit, for instance `The caller has to make sure to call this method in scope of the task's checkpoint lock`. > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > --- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement >Reporter: Paris Carbone >Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3257) Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs
[ https://issues.apache.org/jira/browse/FLINK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210263#comment-15210263 ] ASF GitHub Bot commented on FLINK-3257: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57319737 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- What about naming this as in the comments `drawStateSnapshot`? That it is internal is more or less communicated by the fact that it is a `protected` method. > Add Exactly-Once Processing Guarantees in Iterative DataStream Jobs > --- > > Key: FLINK-3257 > URL: https://issues.apache.org/jira/browse/FLINK-3257 > Project: Flink > Issue Type: Improvement >Reporter: Paris Carbone >Assignee: Paris Carbone > > The current snapshotting algorithm cannot support cycles in the execution > graph. An alternative scheme can potentially include records in-transit > through the back-edges of a cyclic execution graph (ABS [1]) to achieve the > same guarantees. > One straightforward implementation of ABS for cyclic graphs can work as > follows along the lines: > 1) Upon triggering a barrier in an IterationHead from the TaskManager start > block output and start upstream backup of all records forwarded from the > respective IterationSink. > 2) The IterationSink should eventually forward the current snapshotting epoch > barrier to the IterationSource. > 3) Upon receiving a barrier from the IterationSink, the IterationSource > should finalize the snapshot, unblock its output and emit all records > in-transit in FIFO order and continue the usual execution. > -- > Upon restart the IterationSource should emit all records from the injected > snapshot first and then continue its usual execution. > Several optimisations and slight variations can be potentially achieved but > this can be the initial implementation take. > [1] http://arxiv.org/abs/1506.08603 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3257] Add Exactly-Once Processing Guara...
Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1668#discussion_r57319737 --- Diff: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java --- @@ -450,112 +450,121 @@ else if (operator != null) { } } - @Override - public boolean triggerCheckpoint(final long checkpointId, final long timestamp) throws Exception { - LOG.debug("Starting checkpoint {} on task {}", checkpointId, getName()); - - synchronized (lock) { - if (isRunning) { - - // since both state checkpointing and downstream barrier emission occurs in this - // lock scope, they are an atomic operation regardless of the order in which they occur - // we immediately emit the checkpoint barriers, so the downstream operators can start - // their checkpoint work as soon as possible - operatorChain.broadcastCheckpointBarrier(checkpointId, timestamp); - - // now draw the state snapshot - final StreamOperator[] allOperators = operatorChain.getAllOperators(); - final StreamTaskState[] states = new StreamTaskState[allOperators.length]; + /** +* Checkpoints all operator states of the current StreamTask. +* Thread-safety must be handled outside the scope of this function +*/ + protected boolean checkpointStatesInternal(final long checkpointId, long timestamp) throws Exception { --- End diff -- What about naming this as in the comments `drawStateSnapshot`? That it is internal is more or less communicated by the fact that it is a `protected` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3657) Change access of DataSetUtils.countElements() to 'public'
[ https://issues.apache.org/jira/browse/FLINK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210261#comment-15210261 ] ASF GitHub Bot commented on FLINK-3657: --- Github user smarthi commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200848523 @uce @fhueske to add more context to this PR, we r in the final stretch of a planned 0.12.0 Mahout release that adds Flink as a distributed engine for Samsara linear algebra. I will be demoing the Mahout-Flink integration at my talk in ApacheCon, Vancouver, May 9-11. If this can't make it in 1.0.1, I guess we need to go with a redundant clone in Mahout until this becomes available in a future Flink release and want to avoid that. > Change access of DataSetUtils.countElements() to 'public' > -- > > Key: FLINK-3657 > URL: https://issues.apache.org/jira/browse/FLINK-3657 > Project: Flink > Issue Type: Improvement > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Minor > Fix For: 1.0.1 > > > The access of DatasetUtils.countElements() is presently 'private', change > that to be 'public'. We happened to be replicating the functionality in our > project and realized the method already existed in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: FLINK-3657: Change access of DataSetUtils.coun...
Github user smarthi commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200848523 @uce @fhueske to add more context to this PR, we r in the final stretch of a planned 0.12.0 Mahout release that adds Flink as a distributed engine for Samsara linear algebra. I will be demoing the Mahout-Flink integration at my talk in ApacheCon, Vancouver, May 9-11. If this can't make it in 1.0.1, I guess we need to go with a redundant clone in Mahout until this becomes available in a future Flink release and want to avoid that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3633] Fix user code de/serialization in...
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200830880 The other programs that are executed in `ClassLoaderITCase` are contained in the `org.apache.flink.test.classloading.jar` package (search for `KMeansForTest` for example). They are then packaged by the assembly plugin (see `pom.xml` of `flink-tests`). The `ClassLoaderITCase` sets up a cluster and submits the assembled JARs. Can we just follow this setup? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3633) Job submission silently fails when using user code types
[ https://issues.apache.org/jira/browse/FLINK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210201#comment-15210201 ] ASF GitHub Bot commented on FLINK-3633: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200830880 The other programs that are executed in `ClassLoaderITCase` are contained in the `org.apache.flink.test.classloading.jar` package (search for `KMeansForTest` for example). They are then packaged by the assembly plugin (see `pom.xml` of `flink-tests`). The `ClassLoaderITCase` sets up a cluster and submits the assembled JARs. Can we just follow this setup? > Job submission silently fails when using user code types > > > Key: FLINK-3633 > URL: https://issues.apache.org/jira/browse/FLINK-3633 > Project: Flink > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Fix For: 1.1.0 > > > With the changes introduced by FLINK-3327, it is no longer possible to run > remote Flink jobs which work on user code types. The reason is that now the > {{ExecutionConfig}} is directly stored in the {{JobGraph}} which is sent as > an Akka message to the {{JobManager}}. Per default, user code types are > automatically detected and registered in the {{ExecutionConfig}}. When > deserializing a {{JobGraph}} whose {{ExecutionConfig}} contains user code > classes the user code class loader is consequently required. However, Akka > does not have access to it and uses the system class loader. This causes that > Akka silently discards the {{SubmitJob}} message which cannot be deserialized > because of a {{ClassNotFoundException}}. > I propose to not sent the {{ExecutionConfig}} explicitly with the > {{JobGraph}} and, thus, to partially revert the changes to before FLINK-3327. > Before, the {{ExectuionConfig}} was serialized into the job configuration and > deserialized on the {{TaskManager}} using the proper user code class loader. > In order to reproduce the problem you can submit the following job to a > remote cluster. > {code} > public class Job { > public static class CustomType { > private final int value; > public CustomType(int value) { > this.value = value; > } > @Override > public String toString() { > return "CustomType(" + value + ")"; > } > } > public static void main(String[] args) throws Exception { > ExecutionEnvironment env = > ExecutionEnvironment.createRemoteEnvironment(Address, Port, PathToJar); > env.getConfig().disableAutoTypeRegistration(); > DataSet input = env.fromElements(1,2,3,4,5); > DataSet customTypes = input.map(new > MapFunction() { > @Override > public CustomType map(Integer integer) throws Exception > { > return new CustomType(integer); > } > }); > customTypes.print(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3633) Job submission silently fails when using user code types
[ https://issues.apache.org/jira/browse/FLINK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210180#comment-15210180 ] ASF GitHub Bot commented on FLINK-3633: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200821794 I agree that having a dedicated class loading test would be nicer. However, we would need to build a jar containing the test job before executing the test, right? So the easiest solution for that would be a new module. But that is also really ugly. @uce do you have a better idea for that? > Job submission silently fails when using user code types > > > Key: FLINK-3633 > URL: https://issues.apache.org/jira/browse/FLINK-3633 > Project: Flink > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Fix For: 1.1.0 > > > With the changes introduced by FLINK-3327, it is no longer possible to run > remote Flink jobs which work on user code types. The reason is that now the > {{ExecutionConfig}} is directly stored in the {{JobGraph}} which is sent as > an Akka message to the {{JobManager}}. Per default, user code types are > automatically detected and registered in the {{ExecutionConfig}}. When > deserializing a {{JobGraph}} whose {{ExecutionConfig}} contains user code > classes the user code class loader is consequently required. However, Akka > does not have access to it and uses the system class loader. This causes that > Akka silently discards the {{SubmitJob}} message which cannot be deserialized > because of a {{ClassNotFoundException}}. > I propose to not sent the {{ExecutionConfig}} explicitly with the > {{JobGraph}} and, thus, to partially revert the changes to before FLINK-3327. > Before, the {{ExectuionConfig}} was serialized into the job configuration and > deserialized on the {{TaskManager}} using the proper user code class loader. > In order to reproduce the problem you can submit the following job to a > remote cluster. > {code} > public class Job { > public static class CustomType { > private final int value; > public CustomType(int value) { > this.value = value; > } > @Override > public String toString() { > return "CustomType(" + value + ")"; > } > } > public static void main(String[] args) throws Exception { > ExecutionEnvironment env = > ExecutionEnvironment.createRemoteEnvironment(Address, Port, PathToJar); > env.getConfig().disableAutoTypeRegistration(); > DataSet input = env.fromElements(1,2,3,4,5); > DataSet customTypes = input.map(new > MapFunction() { > @Override > public CustomType map(Integer integer) throws Exception > { > return new CustomType(integer); > } > }); > customTypes.print(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3633] Fix user code de/serialization in...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200821794 I agree that having a dedicated class loading test would be nicer. However, we would need to build a jar containing the test job before executing the test, right? So the easiest solution for that would be a new module. But that is also really ugly. @uce do you have a better idea for that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-2935) Allow scala shell to read yarn properties
[ https://issues.apache.org/jira/browse/FLINK-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chiwan Park closed FLINK-2935. -- Resolution: Implemented Fix Version/s: 1.1.0 Implemented via 5108f6875f102e613d70cc90fd13d269465f2bff. > Allow scala shell to read yarn properties > - > > Key: FLINK-2935 > URL: https://issues.apache.org/jira/browse/FLINK-2935 > Project: Flink > Issue Type: Improvement > Components: Scala Shell >Affects Versions: 0.9.1 >Reporter: Johannes >Assignee: Chiwan Park >Priority: Minor > Labels: easyfix > Fix For: 1.1.0 > > > Currently the deployment of flink via yarn and the scala shell are not linked. > When deploying a yarn session the file > bq. org.apache.flink.client.CliFrontend > creates a > bq. .yarn-properties-$username > file with the connection properties. > There should be a way to have the scala shell automatically read this file if > wanted as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2935) Allow scala shell to read yarn properties
[ https://issues.apache.org/jira/browse/FLINK-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210165#comment-15210165 ] ASF GitHub Bot commented on FLINK-2935: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1500 > Allow scala shell to read yarn properties > - > > Key: FLINK-2935 > URL: https://issues.apache.org/jira/browse/FLINK-2935 > Project: Flink > Issue Type: Improvement > Components: Scala Shell >Affects Versions: 0.9.1 >Reporter: Johannes >Assignee: Chiwan Park >Priority: Minor > Labels: easyfix > > Currently the deployment of flink via yarn and the scala shell are not linked. > When deploying a yarn session the file > bq. org.apache.flink.client.CliFrontend > creates a > bq. .yarn-properties-$username > file with the connection properties. > There should be a way to have the scala shell automatically read this file if > wanted as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2935] [scala-shell] Allow Scala shell t...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1500 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2935) Allow scala shell to read yarn properties
[ https://issues.apache.org/jira/browse/FLINK-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210134#comment-15210134 ] ASF GitHub Bot commented on FLINK-2935: --- Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1500#issuecomment-200804302 Oh, thanks for pointing it @mxm. Then I'll merge this. > Allow scala shell to read yarn properties > - > > Key: FLINK-2935 > URL: https://issues.apache.org/jira/browse/FLINK-2935 > Project: Flink > Issue Type: Improvement > Components: Scala Shell >Affects Versions: 0.9.1 >Reporter: Johannes >Assignee: Chiwan Park >Priority: Minor > Labels: easyfix > > Currently the deployment of flink via yarn and the scala shell are not linked. > When deploying a yarn session the file > bq. org.apache.flink.client.CliFrontend > creates a > bq. .yarn-properties-$username > file with the connection properties. > There should be a way to have the scala shell automatically read this file if > wanted as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3657) Change access of DataSetUtils.countElements() to 'public'
[ https://issues.apache.org/jira/browse/FLINK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210133#comment-15210133 ] ASF GitHub Bot commented on FLINK-3657: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200804264 @uce, API stability is not an issue. The method was `private` before and part of the accessible API. IMO, the question is rather whether we want to add new features in a bugfix release. > Change access of DataSetUtils.countElements() to 'public' > -- > > Key: FLINK-3657 > URL: https://issues.apache.org/jira/browse/FLINK-3657 > Project: Flink > Issue Type: Improvement > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Minor > Fix For: 1.0.1 > > > The access of DatasetUtils.countElements() is presently 'private', change > that to be 'public'. We happened to be replicating the functionality in our > project and realized the method already existed in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2935] [scala-shell] Allow Scala shell t...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1500#issuecomment-200804302 Oh, thanks for pointing it @mxm. Then I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: FLINK-3657: Change access of DataSetUtils.coun...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200804264 @uce, API stability is not an issue. The method was `private` before and part of the accessible API. IMO, the question is rather whether we want to add new features in a bugfix release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3639) Add methods and utilities to register DataSets and Tables in the TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210129#comment-15210129 ] ASF GitHub Bot commented on FLINK-3639: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1827#issuecomment-200803230 Changes look good! +1 to merge > Add methods and utilities to register DataSets and Tables in the > TableEnvironment > - > > Key: FLINK-3639 > URL: https://issues.apache.org/jira/browse/FLINK-3639 > Project: Flink > Issue Type: New Feature > Components: Table API >Affects Versions: 1.1.0 >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > > In order to make tables queryable from SQL we need to register them under a > unique name in the TableEnvironment. > [This design > document|https://docs.google.com/document/d/1sITIShmJMGegzAjGqFuwiN_iw1urwykKsLiacokxSw0/edit] > describes the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3639] add methods for registering datas...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1827#issuecomment-200803230 Changes look good! +1 to merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: FLINK-3657: Change access of DataSetUtils.coun...
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200803091 As a side note (I didn't look at the code changes): We can't rename the method name for `1.0.1` as `DataSetUtils` is annotated with `PublicEvolving`, meaning that we are only allowed to change between minor versions, that is for `1.1.0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3657) Change access of DataSetUtils.countElements() to 'public'
[ https://issues.apache.org/jira/browse/FLINK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210127#comment-15210127 ] ASF GitHub Bot commented on FLINK-3657: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1829#issuecomment-200803091 As a side note (I didn't look at the code changes): We can't rename the method name for `1.0.1` as `DataSetUtils` is annotated with `PublicEvolving`, meaning that we are only allowed to change between minor versions, that is for `1.1.0`. > Change access of DataSetUtils.countElements() to 'public' > -- > > Key: FLINK-3657 > URL: https://issues.apache.org/jira/browse/FLINK-3657 > Project: Flink > Issue Type: Improvement > Components: DataSet API >Affects Versions: 1.0.0 >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Minor > Fix For: 1.0.1 > > > The access of DatasetUtils.countElements() is presently 'private', change > that to be 'public'. We happened to be replicating the functionality in our > project and realized the method already existed in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3544] Introduce ResourceManager compone...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1741#issuecomment-200803051 Continuing my review now, but don't block this on me. If I find anything crucial, I will open a pull request against master if it is merged by then.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3544) ResourceManager runtime components
[ https://issues.apache.org/jira/browse/FLINK-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210126#comment-15210126 ] ASF GitHub Bot commented on FLINK-3544: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/1741#issuecomment-200803051 Continuing my review now, but don't block this on me. If I find anything crucial, I will open a pull request against master if it is merged by then.. > ResourceManager runtime components > -- > > Key: FLINK-3544 > URL: https://issues.apache.org/jira/browse/FLINK-3544 > Project: Flink > Issue Type: Sub-task > Components: ResourceManager >Affects Versions: 1.1.0 >Reporter: Maximilian Michels >Assignee: Maximilian Michels > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3547] add support for streaming filter,...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200802410 Thanks for the update! +1 to merge :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3547) Add support for streaming projection, selection, and union
[ https://issues.apache.org/jira/browse/FLINK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210123#comment-15210123 ] ASF GitHub Bot commented on FLINK-3547: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200802410 Thanks for the update! +1 to merge :-) > Add support for streaming projection, selection, and union > -- > > Key: FLINK-3547 > URL: https://issues.apache.org/jira/browse/FLINK-3547 > Project: Flink > Issue Type: Sub-task > Components: Table API >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3633) Job submission silently fails when using user code types
[ https://issues.apache.org/jira/browse/FLINK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210119#comment-15210119 ] ASF GitHub Bot commented on FLINK-3633: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200802077 The changes look good and make parts where user code is serialized very clear. :+1: I've verified that the `ScalaShellITCase` with the change in this PR fails for the current master. I'm wondering whether it is possible to make the test for this more explicit by adding a JAR with the example program outlined in the JIRA issue to the `ClassLoaderITCase`. In any case, +1 to merge. It's your call whether you add a further class loader test or not. > Job submission silently fails when using user code types > > > Key: FLINK-3633 > URL: https://issues.apache.org/jira/browse/FLINK-3633 > Project: Flink > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Fix For: 1.1.0 > > > With the changes introduced by FLINK-3327, it is no longer possible to run > remote Flink jobs which work on user code types. The reason is that now the > {{ExecutionConfig}} is directly stored in the {{JobGraph}} which is sent as > an Akka message to the {{JobManager}}. Per default, user code types are > automatically detected and registered in the {{ExecutionConfig}}. When > deserializing a {{JobGraph}} whose {{ExecutionConfig}} contains user code > classes the user code class loader is consequently required. However, Akka > does not have access to it and uses the system class loader. This causes that > Akka silently discards the {{SubmitJob}} message which cannot be deserialized > because of a {{ClassNotFoundException}}. > I propose to not sent the {{ExecutionConfig}} explicitly with the > {{JobGraph}} and, thus, to partially revert the changes to before FLINK-3327. > Before, the {{ExectuionConfig}} was serialized into the job configuration and > deserialized on the {{TaskManager}} using the proper user code class loader. > In order to reproduce the problem you can submit the following job to a > remote cluster. > {code} > public class Job { > public static class CustomType { > private final int value; > public CustomType(int value) { > this.value = value; > } > @Override > public String toString() { > return "CustomType(" + value + ")"; > } > } > public static void main(String[] args) throws Exception { > ExecutionEnvironment env = > ExecutionEnvironment.createRemoteEnvironment(Address, Port, PathToJar); > env.getConfig().disableAutoTypeRegistration(); > DataSet input = env.fromElements(1,2,3,4,5); > DataSet customTypes = input.map(new > MapFunction() { > @Override > public CustomType map(Integer integer) throws Exception > { > return new CustomType(integer); > } > }); > customTypes.print(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3633] Fix user code de/serialization in...
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1818#issuecomment-200802077 The changes look good and make parts where user code is serialized very clear. :+1: I've verified that the `ScalaShellITCase` with the change in this PR fails for the current master. I'm wondering whether it is possible to make the test for this more explicit by adding a JAR with the example program outlined in the JIRA issue to the `ClassLoaderITCase`. In any case, +1 to merge. It's your call whether you add a further class loader test or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2946) Add orderBy() to Table API
[ https://issues.apache.org/jira/browse/FLINK-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210118#comment-15210118 ] Fabian Hueske commented on FLINK-2946: -- Hi [~dawidwys], thanks a lot for working on this issue! I had a look at your branch. You're definitely on the right track. Here are a few comments: - The Table API syntax looks good - In {{Table.orderBy()}} you should not extract aggregations, etc. Instead check that the expressions match the following patterns ({{Table.as()}} does similar checks): -- {{UnresolvedFieldReference}} -- {{Asc(UnresolvedFieldReference)}} -- {{Desc(UnresolvedFieldReference)}} -- We can add support for more complex expressions and order by position later. - Add asc() to {{RexNodeTranslator}} - I just realized that Flink's range partitioning lacks support to define sort orders for partition keys. We need to add this to make global sorting work correctly. I added FLINK-3665 to address this issue. - We do not need to range partition if the parallelism of the input is 1 (check {{inputDs.getParallelism() == 1}}) I'll be out for vacation for about two weeks. Not sure if I can follow up on this until I am back. > Add orderBy() to Table API > -- > > Key: FLINK-2946 > URL: https://issues.apache.org/jira/browse/FLINK-2946 > Project: Flink > Issue Type: New Feature > Components: Table API >Reporter: Timo Walther >Assignee: Dawid Wysakowicz > > In order to implement a FLINK-2099 prototype that uses the Table APIs code > generation facilities, the Table API needs a sorting feature. > I would implement it the next days. Ideas how to implement such a sorting > feature are very welcome. Is there any more efficient way instead of > {{.sortPartition(...).setParallism(1)}}? Is it better to sort locally on the > nodes first and finally sort on one node afterwards? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3665) Range partitioning lacks support to define sort orders
Fabian Hueske created FLINK-3665: Summary: Range partitioning lacks support to define sort orders Key: FLINK-3665 URL: https://issues.apache.org/jira/browse/FLINK-3665 Project: Flink Issue Type: Improvement Components: DataSet API Affects Versions: 1.0.0 Reporter: Fabian Hueske Fix For: 1.1.0 {{DataSet.partitionByRange()}} does not allow to specify the sort order of fields. This is fine if range partitioning is used to reduce skewed partitioning. However, it is not sufficient if range partitioning is used to sort a data set in parallel. Since {{DataSet.partitionByRange()}} is {{@Public}} API and cannot be easily changed, I propose to add a method {{withOrders(Order... orders)}} to {{PartitionOperator}}. The method should throw an exception if the partitioning method of {{PartitionOperator}} is not range partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2821) Change Akka configuration to allow accessing actors from different URLs
[ https://issues.apache.org/jira/browse/FLINK-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-2821: -- Fix Version/s: 1.1.0 > Change Akka configuration to allow accessing actors from different URLs > --- > > Key: FLINK-2821 > URL: https://issues.apache.org/jira/browse/FLINK-2821 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime >Reporter: Robert Metzger >Assignee: Maximilian Michels > Fix For: 1.1.0 > > > Akka expects the actor's URL to be exactly matching. > As pointed out here, cases where users were complaining about this: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-trying-to-access-JM-through-proxy-td3018.html > - Proxy routing (as described here, send to the proxy URL, receiver > recognizes only original URL) > - Using hostname / IP interchangeably does not work (we solved this by > always putting IP addresses into URLs, never hostnames) > - Binding to multiple interfaces (any local 0.0.0.0) does not work. Still > no solution to that (but seems not too much of a restriction) > I am aware that this is not possible due to Akka, so it is actually not a > Flink bug. But I think we should track the resolution of the issue here > anyways because its affecting our user's satisfaction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3651) Fix faulty RollingSink Restore
[ https://issues.apache.org/jira/browse/FLINK-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210100#comment-15210100 ] ASF GitHub Bot commented on FLINK-3651: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200796710 I think the test fails only sporadically, depending on the speed at which tasks are deleting the files. If you are lucky, everything is correct. > Fix faulty RollingSink Restore > -- > > Key: FLINK-3651 > URL: https://issues.apache.org/jira/browse/FLINK-3651 > Project: Flink > Issue Type: Bug > Components: Streaming >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > The RollingSink restore logic has a bug where the sink for subtask index 1 > also removes files for subtask index 11 because the regex that checks for the > file name also matches that one. Adding the suffix to the regex should solve > the problem because then the regex for 1 will only match files for subtask > index 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3651] Fix faulty RollingSink Restore
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200796710 I think the test fails only sporadically, depending on the speed at which tasks are deleting the files. If you are lucky, everything is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3651) Fix faulty RollingSink Restore
[ https://issues.apache.org/jira/browse/FLINK-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210096#comment-15210096 ] ASF GitHub Bot commented on FLINK-3651: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200795802 Good catch! The changes look good to me. I tried running the adjusted `RollingSinkFaultToleranceITCase` and `RollingSinkFaultTolerance2ITCase` w/o the fix in `RollingSink` and they still succeeded. If this expected? Is there a way we can test this differently? > Fix faulty RollingSink Restore > -- > > Key: FLINK-3651 > URL: https://issues.apache.org/jira/browse/FLINK-3651 > Project: Flink > Issue Type: Bug > Components: Streaming >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > The RollingSink restore logic has a bug where the sink for subtask index 1 > also removes files for subtask index 11 because the regex that checks for the > file name also matches that one. Adding the suffix to the regex should solve > the problem because then the regex for 1 will only match files for subtask > index 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3651] Fix faulty RollingSink Restore
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1830#issuecomment-200795802 Good catch! The changes look good to me. I tried running the adjusted `RollingSinkFaultToleranceITCase` and `RollingSinkFaultTolerance2ITCase` w/o the fix in `RollingSink` and they still succeeded. If this expected? Is there a way we can test this differently? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3524) Provide a JSONDeserialisationSchema in the kafka connector package
[ https://issues.apache.org/jira/browse/FLINK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210093#comment-15210093 ] ASF GitHub Bot commented on FLINK-3524: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1834#issuecomment-200795303 +1 to merge > Provide a JSONDeserialisationSchema in the kafka connector package > -- > > Key: FLINK-3524 > URL: https://issues.apache.org/jira/browse/FLINK-3524 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Reporter: Robert Metzger >Assignee: Chesnay Schepler > Labels: starter > > (I don't want to include this into 1.0.0) > Currently, there is no standardized way of parsing JSON data from a Kafka > stream. I see a lot of users using JSON in their topics. It would make things > easier for our users to provide a serializer for them. > I suggest to use the jackson library because we have that aready as a > dependency in Flink and it allows to parse from a byte[]. > I would suggest to provide the following classes: > - JSONDeserializationSchema() > - JSONDeKeyValueSerializationSchema(bool includeMetadata) > The second variant should produce a record like this: > {code} > {"key": "keydata", > "value": "valuedata", > "metadata": {"offset": 123, "topic": "", "partition": 2 } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3524] [kafka] Add JSONDeserializationSc...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1834#issuecomment-200795303 +1 to merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3658) Allow the FlinkKafkaProducer to send data to multiple topics
[ https://issues.apache.org/jira/browse/FLINK-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210088#comment-15210088 ] ASF GitHub Bot commented on FLINK-3658: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1832 > Allow the FlinkKafkaProducer to send data to multiple topics > > > Key: FLINK-3658 > URL: https://issues.apache.org/jira/browse/FLINK-3658 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Affects Versions: 1.0.0 >Reporter: Robert Metzger >Assignee: Robert Metzger > Fix For: 1.1.0 > > > Currently, the FlinkKafkaProducer is sending all events to one topic defined > when creating the producer. > We could allow users to send messages to multiple topics by extending the > {{KeyedSerializationSchema}} by a method {{public String getTargetTopic(T > element)}} which overrides the default topic if the return value is not null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-3658) Allow the FlinkKafkaProducer to send data to multiple topics
[ https://issues.apache.org/jira/browse/FLINK-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger resolved FLINK-3658. --- Resolution: Fixed Fix Version/s: 1.1.0 Resolved in master for 1.1.0 http://git-wip-us.apache.org/repos/asf/flink/commit/c77e5ece > Allow the FlinkKafkaProducer to send data to multiple topics > > > Key: FLINK-3658 > URL: https://issues.apache.org/jira/browse/FLINK-3658 > Project: Flink > Issue Type: Improvement > Components: Kafka Connector >Affects Versions: 1.0.0 >Reporter: Robert Metzger >Assignee: Robert Metzger > Fix For: 1.1.0 > > > Currently, the FlinkKafkaProducer is sending all events to one topic defined > when creating the producer. > We could allow users to send messages to multiple topics by extending the > {{KeyedSerializationSchema}} by a method {{public String getTargetTopic(T > element)}} which overrides the default topic if the return value is not null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3658][Kafka] Allow producing into multi...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1832 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2609] [streaming] auto-register types
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1833#issuecomment-200793560 +1 to merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2609) Automatic type registration is only called from the batch execution environment
[ https://issues.apache.org/jira/browse/FLINK-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210086#comment-15210086 ] ASF GitHub Bot commented on FLINK-2609: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1833#issuecomment-200793560 +1 to merge > Automatic type registration is only called from the batch execution > environment > --- > > Key: FLINK-2609 > URL: https://issues.apache.org/jira/browse/FLINK-2609 > Project: Flink > Issue Type: Bug > Components: Streaming >Affects Versions: 0.10.0 >Reporter: Robert Metzger >Assignee: Chesnay Schepler > > Kryo types in the streaming API are quite expensive to serialize because they > are not automatically registered at Kryo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210081#comment-15210081 ] Robert Metzger commented on FLINK-3655: --- Thank you for opening a JIRA for this feature request. I think its a good idea and it shouldn't be too difficult to implement. > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-3655) Allow comma-separated or multiple directories to be specified for FileInputFormat
[ https://issues.apache.org/jira/browse/FLINK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-3655: -- Labels: starter (was: ) > Allow comma-separated or multiple directories to be specified for > FileInputFormat > - > > Key: FLINK-3655 > URL: https://issues.apache.org/jira/browse/FLINK-3655 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.0 >Reporter: Gna Phetsarath >Priority: Minor > Labels: starter > > Allow comma-separated or multiple directories to be specified for > FileInputFormat so that a DataSource will process the directories > sequentially. > > env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*") > in Scala >env.readFile(paths: Seq[String]) > or > env.readFile(path: String, otherPaths: String*) > Wildcard support would be a bonus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-3663) FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker
[ https://issues.apache.org/jira/browse/FLINK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-3663. -- Resolution: Fixed Fix Version/s: 1.0.1 1.1.0 Fixed in f0b6ba4 (master), 085f7b2 (release-1.0). > FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker > --- > > Key: FLINK-3663 > URL: https://issues.apache.org/jira/browse/FLINK-3663 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 1.0.0 >Reporter: Niels Zeilemaker >Priority: Trivial > Fix For: 1.1.0, 1.0.1 > > > While debugging a flink kafka app I noticed that the logPartitionInfo method > is broken. It's missing a marker, and hence the stringbuffer is never logged. > I can create a pull-request fixing the problem if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3663) FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker
[ https://issues.apache.org/jira/browse/FLINK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210074#comment-15210074 ] ASF GitHub Bot commented on FLINK-3663: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1835 > FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker > --- > > Key: FLINK-3663 > URL: https://issues.apache.org/jira/browse/FLINK-3663 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 1.0.0 >Reporter: Niels Zeilemaker >Priority: Trivial > > While debugging a flink kafka app I noticed that the logPartitionInfo method > is broken. It's missing a marker, and hence the stringbuffer is never logged. > I can create a pull-request fixing the problem if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: FLINK-3663: FlinkKafkaConsumerBase.logPartitio...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1835 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3663) FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker
[ https://issues.apache.org/jira/browse/FLINK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210072#comment-15210072 ] ASF GitHub Bot commented on FLINK-3663: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1835#issuecomment-200787034 I'm merging this > FlinkKafkaConsumerBase.logPartitionInfo is missing a log marker > --- > > Key: FLINK-3663 > URL: https://issues.apache.org/jira/browse/FLINK-3663 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 1.0.0 >Reporter: Niels Zeilemaker >Priority: Trivial > > While debugging a flink kafka app I noticed that the logPartitionInfo method > is broken. It's missing a marker, and hence the stringbuffer is never logged. > I can create a pull-request fixing the problem if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-3653) recovery.zookeeper.storageDir is not documented on the configuration page
[ https://issues.apache.org/jira/browse/FLINK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi closed FLINK-3653. -- Resolution: Fixed Fix Version/s: 1.0.1 Fixed in 5357ebe (release-1.0), 27dfd86 (master). > recovery.zookeeper.storageDir is not documented on the configuration page > - > > Key: FLINK-3653 > URL: https://issues.apache.org/jira/browse/FLINK-3653 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Stefano Baghino >Assignee: Stefano Baghino >Priority: Minor > Fix For: 1.1.0, 1.0.1 > > > The {{recovery.zookeeper.storageDir}} option is documented in the HA page but > is missing from the configuration page. Since it's required for HA I think it > would be a good idea to have it on both pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3653) recovery.zookeeper.storageDir is not documented on the configuration page
[ https://issues.apache.org/jira/browse/FLINK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210070#comment-15210070 ] ASF GitHub Bot commented on FLINK-3653: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1828 > recovery.zookeeper.storageDir is not documented on the configuration page > - > > Key: FLINK-3653 > URL: https://issues.apache.org/jira/browse/FLINK-3653 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Stefano Baghino >Assignee: Stefano Baghino >Priority: Minor > Fix For: 1.1.0, 1.0.1 > > > The {{recovery.zookeeper.storageDir}} option is documented in the HA page but > is missing from the configuration page. Since it's required for HA I think it > would be a good idea to have it on both pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3653] recovery.zookeeper.storageDir is ...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1828 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3653) recovery.zookeeper.storageDir is not documented on the configuration page
[ https://issues.apache.org/jira/browse/FLINK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210068#comment-15210068 ] ASF GitHub Bot commented on FLINK-3653: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1828#issuecomment-200786158 Thanks! I'm merging this. > recovery.zookeeper.storageDir is not documented on the configuration page > - > > Key: FLINK-3653 > URL: https://issues.apache.org/jira/browse/FLINK-3653 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Stefano Baghino >Assignee: Stefano Baghino >Priority: Minor > Fix For: 1.1.0 > > > The {{recovery.zookeeper.storageDir}} option is documented in the HA page but > is missing from the configuration page. Since it's required for HA I think it > would be a good idea to have it on both pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3653] recovery.zookeeper.storageDir is ...
Github user uce commented on the pull request: https://github.com/apache/flink/pull/1828#issuecomment-200786158 Thanks! I'm merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3547] add support for streaming filter,...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200784086 Thanks. I've addressed you comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3547) Add support for streaming projection, selection, and union
[ https://issues.apache.org/jira/browse/FLINK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210065#comment-15210065 ] ASF GitHub Bot commented on FLINK-3547: --- Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1820#issuecomment-200784086 Thanks. I've addressed you comments! > Add support for streaming projection, selection, and union > -- > > Key: FLINK-3547 > URL: https://issues.apache.org/jira/browse/FLINK-3547 > Project: Flink > Issue Type: Sub-task > Components: Table API >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2935) Allow scala shell to read yarn properties
[ https://issues.apache.org/jira/browse/FLINK-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210043#comment-15210043 ] ASF GitHub Bot commented on FLINK-2935: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/1500#issuecomment-200770572 Thanks @chiwanpark! As far as I can see, there are no overlapping classes between the two pull requests. > Allow scala shell to read yarn properties > - > > Key: FLINK-2935 > URL: https://issues.apache.org/jira/browse/FLINK-2935 > Project: Flink > Issue Type: Improvement > Components: Scala Shell >Affects Versions: 0.9.1 >Reporter: Johannes >Assignee: Chiwan Park >Priority: Minor > Labels: easyfix > > Currently the deployment of flink via yarn and the scala shell are not linked. > When deploying a yarn session the file > bq. org.apache.flink.client.CliFrontend > creates a > bq. .yarn-properties-$username > file with the connection properties. > There should be a way to have the scala shell automatically read this file if > wanted as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2935] [scala-shell] Allow Scala shell t...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/1500#issuecomment-200770572 Thanks @chiwanpark! As far as I can see, there are no overlapping classes between the two pull requests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3544] Introduce ResourceManager compone...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/1741#issuecomment-200769986 I've incorporated the changes and the tests pass. I would like to merge the pull request. Please let me know if there are still pending code reviews. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3639) Add methods and utilities to register DataSets and Tables in the TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210039#comment-15210039 ] ASF GitHub Bot commented on FLINK-3639: --- Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1827#issuecomment-200769959 Thanks. I've addressed your comments! > Add methods and utilities to register DataSets and Tables in the > TableEnvironment > - > > Key: FLINK-3639 > URL: https://issues.apache.org/jira/browse/FLINK-3639 > Project: Flink > Issue Type: New Feature > Components: Table API >Affects Versions: 1.1.0 >Reporter: Vasia Kalavri >Assignee: Vasia Kalavri > > In order to make tables queryable from SQL we need to register them under a > unique name in the TableEnvironment. > [This design > document|https://docs.google.com/document/d/1sITIShmJMGegzAjGqFuwiN_iw1urwykKsLiacokxSw0/edit] > describes the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3544) ResourceManager runtime components
[ https://issues.apache.org/jira/browse/FLINK-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210037#comment-15210037 ] ASF GitHub Bot commented on FLINK-3544: --- Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/1741#discussion_r57295865 --- Diff: flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java --- @@ -240,6 +246,37 @@ */ public static final String FS_STREAM_OPENING_TIMEOUT_KEY = "taskmanager.runtime.fs_timeout"; + + // Common Resource Framework Configuration (YARN & Mesos) + + /** +* Percentage of heap space to remove from containers (YARN / Mesos), to compensate +* for other JVM memory usage. +*/ + public static final String CONTAINERED_HEAP_CUTOFF_RATIO = "containered.heap-cutoff-ratio"; + + /** +* Minimum amount of heap memory to remove in containers, as a safety margin. +*/ + public static final String CONTAINERED_HEAP_CUTOFF_MIN = "containered.heap-cutoff-min"; + + /** +* Prefix for passing custom environment variables to Flink's master process. +* For example for passing LD_LIBRARY_PATH as an env variable to the AppMaster, set: +* yarn.application-master.env.LD_LIBRARY_PATH: "/usr/lib/native" +* in the flink-conf.yaml. +*/ + public static final String CONTAINERED_MASTER_ENV_PREFIX = "containered.application-master.env."; + + /** +* Similar to the {@see CONTAINERED_MASTER_ENV_PREFIX}, this configuration prefix allows +* setting custom environment variables for the workers (TaskManagers) +*/ + public static final String CONTAINERED_TASK_MANAGER_ENV_PREFIX = "containered.taskmanager.env."; --- End diff -- I thought about theses prefixes again. I think they make sense but we could possibly change them before the release in a follow-up. > ResourceManager runtime components > -- > > Key: FLINK-3544 > URL: https://issues.apache.org/jira/browse/FLINK-3544 > Project: Flink > Issue Type: Sub-task > Components: ResourceManager >Affects Versions: 1.1.0 >Reporter: Maximilian Michels >Assignee: Maximilian Michels > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)