[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122117624 Hi @pp86, are you still working on this PR? I would suggest to open a new PR for the documentation (FLINK-2362). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630487#comment-14630487 ] ASF GitHub Bot commented on FLINK-1963: --- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34847588 --- Diff: flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java --- @@ -274,4 +277,45 @@ public Integer map(POJO value) throws Exception { return (int) value.nestedPojo.longNumber; } } + + @Test + public void testCorrectnessOfDistinctOnAtomic() throws Exception { + /* + * check correctness of distinct on Integers + */ + + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env); + DataSetInteger reduceDs = ds.distinct(); + + ListInteger result = reduceDs.collect(); + + String expected = 1\n2\n3\n4\n5; + + compareResultAsText(result, expected); + } + + @Test + public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws Exception { + /* --- End diff -- Looks like misaligned for the comment block. Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630488#comment-14630488 ] ASF GitHub Bot commented on FLINK-1963: --- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34847597 --- Diff: flink-tests/src/test/scala/org/apache/flink/api/scala/operators/DistinctITCase.scala --- @@ -174,6 +174,45 @@ class DistinctITCase(mode: TestExecutionMode) extends MultipleProgramsTestBase(m env.execute() expected = 1\n2\n3\n } + + @Test + def testCorrectnessOfDistinctOnAtomic(): Unit = { +/* --- End diff -- Looks like misaligned for the comment block. Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske closed FLINK-1963. Resolution: Fixed Fix Version/s: (was: 0.9) 0.10 fixed with 08ca9ffa9a95610c073145a09e731311e728c4fd Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.10 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1748) Integrate PageRank implementation into machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1748: - Priority: Minor (was: Major) Integrate PageRank implementation into machine learning library --- Key: FLINK-1748 URL: https://issues.apache.org/jira/browse/FLINK-1748 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Priority: Minor Labels: ML, Starter We already have an excellent approximative PageRank [1] implementation which has been contributed by [~StephanEwen]. Making this implementation part of the machine learning library would be a great addition. Resources: [1] [http://en.wikipedia.org/wiki/PageRank] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages
[ https://issues.apache.org/jira/browse/FLINK-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629436#comment-14629436 ] ASF GitHub Bot commented on FLINK-2332: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/917 [FLINK-2332] [runtime] Adds leader session IDs and registration session IDs ## Registration Session IDs Introduces registration session IDs for all registration messages. Upon receiving a registration message, this ID is checked and if not correct, the message is discarded. That way, it is possible to distinguish old registration messages which are delayed from valid ones. In the current implementation, a static registration session ID is assigned when the `TaskManager` actor is created. However, with support for high availability, where the leader can change while trying to register at old leader, it becomes important to distinguish old from new registration messages. ## Leader Session IDs In order to support high availability, we not only have to distinguish the old from the new registration messages, but also control messages which are sent to and from the `JobManager` and the `TaskManager`. In order to filter out possibly old messages, this PR introduces a leader session ID which denotes the currently valid messages. However, unlike the registration session ID, leader session IDs are assigned to messages transparently. Messages which extend the `RequiresLeaderSessionID` interface will be wrapped in a `LeaderSessionMessage` which also contains the currently known leader session ID. At the receiving end, the `LeaderSessionMessages` are unpacked and the received leader session ID is compared to the currently stored leader session ID. If both IDs are the same, the wrapped message is processed. If not, then wrapped message will be discarded. In order to support this behaviour, the PR introduces a new `FlinkActor` for Scala actors and a `FlinkUntypedActor` for Java actors. Both actors provide a `decorateMessage` method which allows sub types of the `FlinkActor`/`FlinkUntypedActor` to decorate outgoing messages. Therefore, all implementing classes are supposed to call `decorateMessage` before sending a message to another actor. The `FlinkUntypedActor` already comes with support for message logging and leader session message filtering. Furthermore, its `decorateMessage` method implementation checks for each message if it's an subtype of `RequiresLeaderSessionID` or not and if it is the case, then wraps this message in a `LeaderSessionMessage`. The receive method of this class, will then take care to unwrap the messages accordingly. In order to have the same behavior with Scala actors one has to extend the `FlinkActor` and mixin the `LeaderSessionMessages` and `LoggingMessages` mixins. They effectively do the same as the `FlinkUntypedActor`, but offer a better extensibility of the Scala actors in the future. In case that a `RequiresLeaderSessionID` message is received without being wrapped in a `LeaderSessionMessage` by a FlinkActor, an exception is thrown, which effectively terminates the execution of the actor. The reason for this is that a unwrapped message might leave the system in an inconsistent state if it's a message from an old leader. Furthermore, since not every `RequiresLeaderSessionID` message requires a response, it is not possible to notify the sender of the wrong message about the missing leader session ID. ## ActorGateway refactoring In order to guarantee the similar wrapping behaviour when one sends messages outside of an actor, the former `InstanceGateway` has been refactored to `ActorGateway` and all `ActorRef` interactions have been replaced by `ActorGateway` instances. Only the web server still uses `ActorRefs`, because it is about to be refactored anyway (see #677). However, the PR #677 should be updated accordingly. `AkkaActorGateway` implements the `ActorGateway` and makes sure that all `RequiresLeaderSessionID` messages are wrapped correctly in a `LeaderSessionMessage`. In order to make this happen, the `AkkaActorGateway` is given the current leader session ID upon creation. For any actor interaction from outside of an actor, this class should be used. Using this abstraction will allow us to easily extend the decoration behaviour of messages in the future, too. ## Style formatting The PR also contains some Scala style harmonisation of old Scala code. ## TL;DR In order to support leader session IDs all Flink actors should extend `FlinkUntypedActor` or `FlinkActor` with `LeaderSessionMessages` and `LogMessages` mixins. Whenever a message is sent from within an actor, the message should be decorated by calling `decorateMessage`. When messages
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629503#comment-14629503 ] Suminda Dharmasena commented on FLINK-2366: --- You are using it as a library. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629505#comment-14629505 ] Till Rohrmann commented on FLINK-2366: -- But why can you not do the same with ZK? If you start a DB process next to your Flink cluster, then you can also start a ZK process, right? HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629515#comment-14629515 ] Fabian Hueske commented on FLINK-2366: -- Besides being a distributed data processing engine, Flink offers an embedded mode which executes Flink programs on Java collections. Is this what you refer to? If yes, this mode does not require HA. It runs in the same JVM process as your application and is not meant for large data sets. If that process goes down, you don't need HA anymore. I don't see another option than this mode to use Flink as a library. Except for the embedded collection execution mode, Flink runs as a distributed system for which HA is a very useful feature. As [~senorcarbone] pointed out, HA in a distributed system means leader election, Paxos, etc. ZK provides all of this and reimplementing that in Flink is not an option, IMO. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2365) Review of How to contribute page
[ https://issues.apache.org/jira/browse/FLINK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629540#comment-14629540 ] Enrique Bautista Barahona commented on FLINK-2365: -- Here's the PR: https://github.com/apache/flink-web/pull/3 Review of How to contribute page Key: FLINK-2365 URL: https://issues.apache.org/jira/browse/FLINK-2365 Project: Flink Issue Type: Bug Components: Project Website Reporter: Enrique Bautista Barahona Priority: Minor While reading the [How to contribute page|https://flink.apache.org/how-to-contribute.html] on the website I have noticed some typos, broken links, inconsistent formatting, etc. I plan to submit a PR with some improvements soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2364) Link to Guide is Broken
[ https://issues.apache.org/jira/browse/FLINK-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels resolved FLINK-2364. --- Resolution: Not A Problem Thanks for reporting. Due to the new website's structure, the guide is now located at https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html. Currently, we cannot use redirects on our documentation infrastructure. Link to Guide is Broken --- Key: FLINK-2364 URL: https://issues.apache.org/jira/browse/FLINK-2364 Project: Flink Issue Type: Bug Reporter: Suminda Dharmasena https://ci.apache.org/projects/flink/flink-docs-master/libs/programming_guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629494#comment-14629494 ] Suminda Dharmasena commented on FLINK-2366: --- You can have an embedded DB to which state is synchronised. ZK can be a secondary option. If you are using Flink purely embed you cannot deploy other software unless your program handles it. This is inconvenient can you cannot take into account all the options. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629500#comment-14629500 ] Fabian Hueske commented on FLINK-2366: -- What exactly do you mean by using Flink purely embedded? High availability is a feature of distributed systems and I am not sure how that would apply to an embedded system. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629507#comment-14629507 ] Till Rohrmann commented on FLINK-2366: -- But in the end, you will execute your jobs on a Flink cluster, right? Or what do you do with Flink as a library? HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629514#comment-14629514 ] Suminda Dharmasena commented on FLINK-2366: --- Do not want to start another process. An embedded DB (Derby, H2, etc.) does not run as a separate process it is embedded within the program itself. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2267) Support multi-class scoring for binary classification scores
[ https://issues.apache.org/jira/browse/FLINK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629549#comment-14629549 ] Rohit Shinde commented on FLINK-2267: - Can this issue be assigned to me? What all should I need to know before I start with this? This will be my first issue. Support multi-class scoring for binary classification scores Key: FLINK-2267 URL: https://issues.apache.org/jira/browse/FLINK-2267 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Some scores like accuracy, recall and F-score are designed for binary classification. They can be used to evaluate multi-class problems as well by using micro or macro averaging techniques. This ticket is about creating such an option, allowing our binary classification metrics to be used in multi-class problems. You can check out [sklearn's user guide|http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification] for more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629565#comment-14629565 ] Till Rohrmann commented on FLINK-2366: -- What do you mean exactly with embedded and distributed? If you use Flink's embedded mode, then this would mean that every node would work independently of each other. There is no possibility to make the embedded Flink instances to work together. If you want to run Flink in a distributed manner, then you have to start a Flink cluster. And then you can also start a ZK on the same nodes, if there is none available. But usually, you want to run these kind of things on highly reliable nodes and not in yarn containers, for example. On a first glance, copycat seems to be usable for HA, as well. It offers a similar functionality what we're currently using from ZK. It should not be a problem to implement a new {{LeaderElectionService}}/{{LeaderRetrievalService}} which uses copycat. However, copycat is still under heavy development and not recommended to be used in production. And you have the problem that your fault-tolerant value store is running on the same nodes as Flink. These nodes don't have to be necessarily reliable. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2364) Link to Guide is Broken
[ https://issues.apache.org/jira/browse/FLINK-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629508#comment-14629508 ] Suminda Dharmasena commented on FLINK-2364: --- I cannot rememeber the refereing ddocument but there are many Flink documents which refere to invalid links. Perhaps they have moved. Link to Guide is Broken --- Key: FLINK-2364 URL: https://issues.apache.org/jira/browse/FLINK-2364 Project: Flink Issue Type: Bug Reporter: Suminda Dharmasena https://ci.apache.org/projects/flink/flink-docs-master/libs/programming_guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629532#comment-14629532 ] Suminda Dharmasena commented on FLINK-2366: --- What I am pushing for is embedded, HA, and distributed. Any of the nodes should be able to discover other nodes and cluster. The program is distributed but each instance embeds Flink as library. There are DB which can be used as libraries like Derby, H2. This can be used. Also there are libraries which are not heavy weigh and can be embeded as a library like: https://github.com/kuujo/copycat. Alternatively you can see how you can embed ZK and use it while the whole of Flink can be used as a library in a distributed, HA context. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2300) Links on FAQ page not rendered correctly
[ https://issues.apache.org/jira/browse/FLINK-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628555#comment-14628555 ] Enrique Bautista Barahona edited comment on FLINK-2300 at 7/16/15 10:38 AM: [~mxm], I've thought about this and maybe it doesn't make much sense to update the links. The plugin let us hide the base url but, after all: - there aren't that much links of that type, and they aren't that much different from a regular link - you have to write the rest of the path to the file or folder, which can and usually will be quite long and messy (e.g. {{flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java#L831}}). - there are a lot of other explicit markdown links to the GitHub Apache repo, the Stratosphere repo, the issue tracker, mailing lists, etc.; some of them with hardcoded urls, some of them reading constants from the config file. I think it would be simpler to just not use the plugin at all (remove it, even) and instead write regular markdown links. If you still want to abstract or reuse the urls, it may be better to just use the config file. But for the sake of consistency, maybe it should be done with every link, and I'm not sure it's worth it. What do you think? was (Author: ebautistabar): [~mxm], I've thought about this and maybe it doesn't make much sense to update the links. The plugin let us hide the base url but, after all: - there aren't that much links of that type, and they aren't that much different from a regular link - you have to write the rest of the path to the file or folder, which can and usually will be quite long and messy (e.g. {{flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java#L831}}). - there are a lot of other explicit markdown links to the GitHub Apache repo, the Stratosphere repo, the issue tracker, mailing lists, etc.; some of them with hardcoded urls, some of them reading constants from the config file. I think it would be simpler to just not use the plugin at all (remove it, even) and instead write regular markdown links. If you still want to abstract or reuse the urls, it may be better to just use the config file. But for the sake of consistency, maybe it should be done with every link, and I'm not sure it's worth it. Links on FAQ page not rendered correctly Key: FLINK-2300 URL: https://issues.apache.org/jira/browse/FLINK-2300 Project: Flink Issue Type: Bug Components: Project Website Reporter: Robert Metzger Assignee: Maximilian Michels Priority: Minor Labels: starter Attachments: fix_github_plugin.patch On the Flink website, the links using the github plugin are broken. For example {code} {% github README.md master build instructions %} {code} renders to {code} https://github.com/apache/flink/tree/master/README.md {code} See: http://flink.apache.org/faq.html#my-job-fails-early-with-a-javaioeofexception-what-could-be-the-cause I was not able to resolve the issue by using {{a}} tags or markdown links. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-121924666 Okay. @tillrohrmann, can you review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629545#comment-14629545 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-121924666 Okay. @tillrohrmann, can you review this? Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-121928501 Will do, once I have some free time. Currently I've to finish the high availability support. Try to hurry up. On Thu, Jul 16, 2015 at 12:50 PM, Sachin Goel notificati...@github.com wrote: Okay. @tillrohrmann https://github.com/tillrohrmann, can you review this? â Reply to this email directly or view it on GitHub https://github.com/apache/flink/pull/757#issuecomment-121924666. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-1748) Integrate PageRank implementation into machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-1748. Resolution: Done As part of Gelly. Integrate PageRank implementation into machine learning library --- Key: FLINK-1748 URL: https://issues.apache.org/jira/browse/FLINK-1748 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Priority: Minor Labels: ML, Starter We already have an excellent approximative PageRank [1] implementation which has been contributed by [~StephanEwen]. Making this implementation part of the machine learning library would be a great addition. Resources: [1] [http://en.wikipedia.org/wiki/PageRank] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2332] [runtime] Adds leader session IDs...
GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/917 [FLINK-2332] [runtime] Adds leader session IDs and registration session IDs ## Registration Session IDs Introduces registration session IDs for all registration messages. Upon receiving a registration message, this ID is checked and if not correct, the message is discarded. That way, it is possible to distinguish old registration messages which are delayed from valid ones. In the current implementation, a static registration session ID is assigned when the `TaskManager` actor is created. However, with support for high availability, where the leader can change while trying to register at old leader, it becomes important to distinguish old from new registration messages. ## Leader Session IDs In order to support high availability, we not only have to distinguish the old from the new registration messages, but also control messages which are sent to and from the `JobManager` and the `TaskManager`. In order to filter out possibly old messages, this PR introduces a leader session ID which denotes the currently valid messages. However, unlike the registration session ID, leader session IDs are assigned to messages transparently. Messages which extend the `RequiresLeaderSessionID` interface will be wrapped in a `LeaderSessionMessage` which also contains the currently known leader session ID. At the receiving end, the `LeaderSessionMessages` are unpacked and the received leader session ID is compared to the currently stored leader session ID. If both IDs are the same, the wrapped message is processed. If not, then wrapped message will be discarded. In order to support this behaviour, the PR introduces a new `FlinkActor` for Scala actors and a `FlinkUntypedActor` for Java actors. Both actors provide a `decorateMessage` method which allows sub types of the `FlinkActor`/`FlinkUntypedActor` to decorate outgoing messages. Therefore, all implementing classes are supposed to call `decorateMessage` before sending a message to another actor. The `FlinkUntypedActor` already comes with support for message logging and leader session message filtering. Furthermore, its `decorateMessage` method implementation checks for each message if it's an subtype of `RequiresLeaderSessionID` or not and if it is the case, then wraps this message in a `LeaderSessionMessage`. The receive method of this class, will then take care to unwrap the messages accordingly. In order to have the same behavior with Scala actors one has to extend the `FlinkActor` and mixin the `LeaderSessionMessages` and `LoggingMessages` mixins. They effectively do the same as the `FlinkUntypedActor`, but offer a better extensibility of the Scala actors in the future. In case that a `RequiresLeaderSessionID` message is received without being wrapped in a `LeaderSessionMessage` by a FlinkActor, an exception is thrown, which effectively terminates the execution of the actor. The reason for this is that a unwrapped message might leave the system in an inconsistent state if it's a message from an old leader. Furthermore, since not every `RequiresLeaderSessionID` message requires a response, it is not possible to notify the sender of the wrong message about the missing leader session ID. ## ActorGateway refactoring In order to guarantee the similar wrapping behaviour when one sends messages outside of an actor, the former `InstanceGateway` has been refactored to `ActorGateway` and all `ActorRef` interactions have been replaced by `ActorGateway` instances. Only the web server still uses `ActorRefs`, because it is about to be refactored anyway (see #677). However, the PR #677 should be updated accordingly. `AkkaActorGateway` implements the `ActorGateway` and makes sure that all `RequiresLeaderSessionID` messages are wrapped correctly in a `LeaderSessionMessage`. In order to make this happen, the `AkkaActorGateway` is given the current leader session ID upon creation. For any actor interaction from outside of an actor, this class should be used. Using this abstraction will allow us to easily extend the decoration behaviour of messages in the future, too. ## Style formatting The PR also contains some Scala style harmonisation of old Scala code. ## TL;DR In order to support leader session IDs all Flink actors should extend `FlinkUntypedActor` or `FlinkActor` with `LeaderSessionMessages` and `LogMessages` mixins. Whenever a message is sent from within an actor, the message should be decorated by calling `decorateMessage`. When messages are sent from outside of an actor, the `AkkaActorGateway` should be used instead of directly using `ActorRef`. That way, we guarantee the proper wrapping of the messages. You can merge this pull request into a Git repository by running: $ git
[jira] [Updated] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2366: - Priority: Minor (was: Blocker) HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Bug Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629373#comment-14629373 ] Paris Carbone commented on FLINK-2366: -- That's a tough one! ^^ If you mean using something else, we could use a transactional persistent database which has the same impact with zookeeper (perhaps better throughput). If on the other hand you mean creating something totally custom that can be pretty tricky as Till mentioned. We need leader election, 2-phase paxos commit etc. These things are already solved by zookeeper HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629596#comment-14629596 ] Ufuk Celebi commented on FLINK-2366: I agree with Till. The use case with embedded mode (collections executions) seems off. But I think there is a nice related use case where we just want to be able to restart a single job manager and continue where we left off. Then we only need the persistent storage part and no distributed consensus. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2368) Add a convergence criteria
Sachin Goel created FLINK-2368: -- Summary: Add a convergence criteria Key: FLINK-2368 URL: https://issues.apache.org/jira/browse/FLINK-2368 Project: Flink Issue Type: Bug Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel Add a convergence criteria for the Machine Learning library. If an algorithm is iteration based, it should have a setConvergenceCritieria function which will check for convergence at the end of every iteration based on the user defined criteria. Algorithm writer will provide the user will the training data set, solution before an iteration and solution after the iteration and user will have to return an empty data set in case the training needs to be terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2367) “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete
[ https://issues.apache.org/jira/browse/FLINK-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated FLINK-2367: - Issue Type: Improvement (was: Bug) “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete - Key: FLINK-2367 URL: https://issues.apache.org/jira/browse/FLINK-2367 Project: Flink Issue Type: Improvement Components: JobManager Affects Versions: 0.9 Environment: Linux Reporter: chenliang Priority: Minor Labels: reliability Fix For: 0.9.0 For checking system whether be adequately reliability, testers usually designedly do some delete operation. Steps: 1.go to flink\build-target\log 2.delete “flink-xx-jobmanager-linux-3lsu.log file 3.Run jobs along with writing log info, meanwhile the system didn't give any error info when the log info can't be wrote correctly. 4.when some jobs be run failed , go to check log file for finding the reason, can't find the log file. Must restart Job Manager to regenerate the log file, then continue to run jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629575#comment-14629575 ] ASF GitHub Bot commented on FLINK-2131: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-121928501 Will do, once I have some free time. Currently I've to finish the high availability support. Try to hurry up. On Thu, Jul 16, 2015 at 12:50 PM, Sachin Goel notificati...@github.com wrote: Okay. @tillrohrmann https://github.com/tillrohrmann, can you review this? — Reply to this email directly or view it on GitHub https://github.com/apache/flink/pull/757#issuecomment-121924666. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2368) Add a convergence criteria
[ https://issues.apache.org/jira/browse/FLINK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629757#comment-14629757 ] ASF GitHub Bot commented on FLINK-2368: --- GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/918 [FLINK-2368][ml]Adds convergence criteria [WIP] Adds a convergence criteria class which allows the user to decide whether they want to terminate training at any point, based on the solutions before and after an iteration. Adding the first example to the SVM trainer.[Not finished yet]. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink convergence Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/918.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #918 commit 656010cd1ef7df5bffafadb6a8be9bd7db715533 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-07-16T13:50:57Z Initial framework design Add a convergence criteria -- Key: FLINK-2368 URL: https://issues.apache.org/jira/browse/FLINK-2368 Project: Flink Issue Type: Bug Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel Add a convergence criteria for the Machine Learning library. If an algorithm is iteration based, it should have a setConvergenceCritieria function which will check for convergence at the end of every iteration based on the user defined criteria. Algorithm writer will provide the user will the training data set, solution before an iteration and solution after the iteration and user will have to return an empty data set in case the training needs to be terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2368][ml]Adds convergence criteria [WIP...
Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/918#discussion_r34788837 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala --- @@ -382,7 +402,11 @@ object SVM{ // calculate the new weight vector by adding the weight vector updates to the weight // vector value -weights.union(weightedDeltaWs).reduce { _ + _ } +val newWeights = weights.union(weightedDeltaWs).reduce { _ + _ } +if(convergenceDefined){ + converger.get.construct(input,weights,newWeights) --- End diff -- @tillrohrmann , could you take a look here? I am unable to work out the problem here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2368) Add a convergence criteria
[ https://issues.apache.org/jira/browse/FLINK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629761#comment-14629761 ] ASF GitHub Bot commented on FLINK-2368: --- Github user sachingoel0101 commented on a diff in the pull request: https://github.com/apache/flink/pull/918#discussion_r34788837 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala --- @@ -382,7 +402,11 @@ object SVM{ // calculate the new weight vector by adding the weight vector updates to the weight // vector value -weights.union(weightedDeltaWs).reduce { _ + _ } +val newWeights = weights.union(weightedDeltaWs).reduce { _ + _ } +if(convergenceDefined){ + converger.get.construct(input,weights,newWeights) --- End diff -- @tillrohrmann , could you take a look here? I am unable to work out the problem here. Add a convergence criteria -- Key: FLINK-2368 URL: https://issues.apache.org/jira/browse/FLINK-2368 Project: Flink Issue Type: Bug Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel Add a convergence criteria for the Machine Learning library. If an algorithm is iteration based, it should have a setConvergenceCritieria function which will check for convergence at the end of every iteration based on the user defined criteria. Algorithm writer will provide the user will the training data set, solution before an iteration and solution after the iteration and user will have to return an empty data set in case the training needs to be terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2368][ml]Adds convergence criteria [WIP...
GitHub user sachingoel0101 opened a pull request: https://github.com/apache/flink/pull/918 [FLINK-2368][ml]Adds convergence criteria [WIP] Adds a convergence criteria class which allows the user to decide whether they want to terminate training at any point, based on the solutions before and after an iteration. Adding the first example to the SVM trainer.[Not finished yet]. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink convergence Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/918.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #918 commit 656010cd1ef7df5bffafadb6a8be9bd7db715533 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-07-16T13:50:57Z Initial framework design --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-2369) On Windows, in testFailingSortingDataSinkTask the temp file is not removed
Gabor Gevay created FLINK-2369: -- Summary: On Windows, in testFailingSortingDataSinkTask the temp file is not removed Key: FLINK-2369 URL: https://issues.apache.org/jira/browse/FLINK-2369 Project: Flink Issue Type: Bug Components: Distributed Runtime Environment: Windows 7 64-bit, JDK 8u51 Reporter: Gabor Gevay Assignee: Gabor Gevay Priority: Minor The test fails with the assert at the very end (Temp output file has not been removed). This happens because FileOutputFormat.tryCleanupOnError can't delete the file, because it is still open (note, that Linux happily deletes open files). A fix would be to have the this.format.close(); not just in the finally block (in DataSinkTask.invoke), but also before the tryCleanupOnError call (around line 217). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2367) “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete
chenliang created FLINK-2367: Summary: “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete Key: FLINK-2367 URL: https://issues.apache.org/jira/browse/FLINK-2367 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.9 Environment: Linux Reporter: chenliang Priority: Minor Fix For: 0.9.0 For checking system whether be adequately reliability, testers usually designedly do some delete operation. Steps: 1.go to flink\build-target\log 2.delete “flink-xx-jobmanager-linux-3lsu.log file 3.Run jobs along with writing log info, meanwhile the system didn't give any error info when the log info can't be wrote correctly. 4.when some jobs be run failed , go to check log file for finding the reason, can't find the log file. Must restart Job Manager to regenerate the log file, then continue to run jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2370) Unify Client Code
Matthias J. Sax created FLINK-2370: -- Summary: Unify Client Code Key: FLINK-2370 URL: https://issues.apache.org/jira/browse/FLINK-2370 Project: Flink Issue Type: Improvement Reporter: Matthias J. Sax Assignee: Matthias J. Sax Right now client code that send commands to JobManager is spread over multiple classes resulting in code duplication. Furthermore, each client offers a different sub-set of available commands. The idea is to use Client.java as a central class, that provides all available command that the JobManager can handle. All other client use Client.java internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2111] Add stop signal to cleanly shut...
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/750#issuecomment-122071560 Added fix for https://issues.apache.org/jira/browse/FLINK-2338 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2352) [Graph Visualization] Integrate Gelly with Gephi
[ https://issues.apache.org/jira/browse/FLINK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andra Lungu updated FLINK-2352: --- Assignee: Shivani Ghatge [Graph Visualization] Integrate Gelly with Gephi Key: FLINK-2352 URL: https://issues.apache.org/jira/browse/FLINK-2352 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.10 Reporter: Andra Lungu Assignee: Shivani Ghatge This integration will allow users to see the real-time progress of their graph. They could also visually verify results for clustering algorithms, for example. Gephi is free/open-source and provides support for all types of networks, including dynamic and hierarchical graphs. A first step would be to add the Gephi Toolkit to the pom.xml. https://github.com/gephi/gephi-toolkit Afterwards, a GraphBuilder similar to this one https://github.com/palmerabollo/test-twitter-graph/blob/master/src/main/java/es/guido/twitter/graph/GraphBuilder.java can be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2361) flatMap + distinct gives erroneous results for big data sets
[ https://issues.apache.org/jira/browse/FLINK-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630360#comment-14630360 ] Andra Lungu commented on FLINK-2361: Hi, I tweaked the code to look like this: DataSetEdgeString, NullValue edges = getEdgesDataSet(env); DataSetVertexString, Long vertices = edges.flatMap(new FlatMapFunctionEdgeString, NullValue, VertexString, Long() { @Override public void flatMap(EdgeString, NullValue edge, CollectorVertexString, Long collector) throws Exception { collector.collect(new VertexString, Long(edge.getSource(), Long.parseLong(edge.getSource(; collector.collect(new VertexString, Long(edge.getTarget(), Long.parseLong(edge.getTarget(; } }).distinct(); if (fileOutput) { vertices.writeAsCsv(vertexInputPath, \n, ,); env.execute(); } DataSetVertexString, Long rereadVertices = env.readCsvFile(vertexInputPath) .fieldDelimiter(,).lineDelimiter(\n).ignoreComments(#) .types(String.class, Long.class).map(new MapFunctionTuple2String, Long, VertexString, Long() { @Override public VertexString, Long map(Tuple2String, Long tuple2) throws Exception { return new VertexString, Long(tuple2.f0, tuple2.f1); } }); GraphString, Long, NullValue graph = Graph.fromDataSet(rereadVertices, edges, env); which is not what I would normally do, BTW... and I still get: Caused by: java.lang.Exception: Target vertex '124518874' does not exist!. at org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300) at org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) Lucky day :| At least it's missing a different vertex now... flatMap + distinct gives erroneous results for big data sets Key: FLINK-2361 URL: https://issues.apache.org/jira/browse/FLINK-2361 Project: Flink Issue Type: Bug Components: Gelly Affects Versions: 0.10 Reporter: Andra Lungu When running the simple Connected Components algorithm (currently in Gelly) on the twitter follower graph, with 1, 100 or 1 iterations, I get the following error: Caused by: java.lang.Exception: Target vertex '657282846' does not exist!. at org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300) at org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) Now this is very bizzare as the DataSet of vertices is produced from the DataSet of edges... Which means there cannot be a an edge with an invalid target id... The method calls flatMap to isolate the src and trg ids and distinct to ensure their uniqueness. The algorithm works fine for smaller data sets... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs
[ https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630233#comment-14630233 ] ASF GitHub Bot commented on FLINK-2111: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/750#issuecomment-122071560 Added fix for https://issues.apache.org/jira/browse/FLINK-2338 Add stop signal to cleanly shutdown streaming jobs Key: FLINK-2111 URL: https://issues.apache.org/jira/browse/FLINK-2111 Project: Flink Issue Type: Improvement Components: Distributed Runtime, JobManager, Local Runtime, Streaming, TaskManager, Webfrontend Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Minor Currently, streaming jobs can only be stopped using cancel command, what is a hard stop with no clean shutdown. The new introduced stop signal, will only affect streaming source tasks such that the sources can stop emitting data and shutdown cleanly, resulting in a clean shutdown of the whole streaming job. This feature is a pre-requirment for https://issues.apache.org/jira/browse/FLINK-1929 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: Update README.md
Github user chenliag613 closed the pull request at: https://github.com/apache/flink/pull/912 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630814#comment-14630814 ] Suminda Dharmasena commented on FLINK-2366: --- In which case can you add the ability for Flink to startup and configure the ZK instance internally. Flink can be used as if it was a library without having to do any other deployments. Also in this case the JobManager will be started through the API not running a batch file. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: Update README.md
Github user chenliag613 commented on the pull request: https://github.com/apache/flink/pull/912#issuecomment-122147385 Thanks @chiwanpark @mxm . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122121617 No worries :-) You better use a dedicated branch for your next PR instead of the master branch. I'll go ahead and merge this PR now. Thanks for the contribution! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630455#comment-14630455 ] ASF GitHub Bot commented on FLINK-1963: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122121617 No worries :-) You better use a dedicated branch for your next PR instead of the master branch. I'll go ahead and merge this PR now. Thanks for the contribution! Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122132484 Merged, thanks for the contribution @pp86 ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630553#comment-14630553 ] ASF GitHub Bot commented on FLINK-1963: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122132484 Merged, thanks for the contribution @pp86 ! Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/905 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630551#comment-14630551 ] ASF GitHub Bot commented on FLINK-1963: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/905 Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630421#comment-14630421 ] ASF GitHub Bot commented on FLINK-1963: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122117624 Hi @pp86, are you still working on this PR? I would suggest to open a new PR for the documentation (FLINK-2362). Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34847588 --- Diff: flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java --- @@ -274,4 +277,45 @@ public Integer map(POJO value) throws Exception { return (int) value.nestedPojo.longNumber; } } + + @Test + public void testCorrectnessOfDistinctOnAtomic() throws Exception { + /* + * check correctness of distinct on Integers + */ + + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env); + DataSetInteger reduceDs = ds.distinct(); + + ListInteger result = reduceDs.collect(); + + String expected = 1\n2\n3\n4\n5; + + compareResultAsText(result, expected); + } + + @Test + public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws Exception { + /* --- End diff -- Looks like misaligned for the comment block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122131578 Good catch @hsaputra :-) I'll fix it before merging --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630540#comment-14630540 ] ASF GitHub Bot commented on FLINK-1963: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122131578 Good catch @hsaputra :-) I'll fix it before merging Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-685) Add support for semi-joins
[ https://issues.apache.org/jira/browse/FLINK-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630528#comment-14630528 ] pietro pinoli commented on FLINK-685: - Dear [~fhueske], are you (community) still interested in this? I am developing a project for genomics which relies on Flink, and having such feature may help us a lot. I would love to take it, but I guess I'll need some help from you (or some other guy), so as to get started:) Let me know. Thanks, Pietro. Add support for semi-joins -- Key: FLINK-685 URL: https://issues.apache.org/jira/browse/FLINK-685 Project: Flink Issue Type: New Feature Reporter: GitHub Import Priority: Minor Labels: github-import Fix For: pre-apache A semi-join is basically a join filter. One input is filtering and the other one is filtered. A tuple of the filtered input is emitted exactly once if the filtering input has one (ore more) tuples with matching join keys. That means that the output of a semi-join has the same type as the filtered input and the filtering input is completely discarded. In order to support a semi-join, we need to add an additional physical execution strategy, that ensures, that a tuple of the filtered input is emitted only once if the filtering input has more than one tuple with matching keys. Furthermore, a couple of optimizations compared to standard joins can be done such as storing only keys and not the full tuple of the filtering input in a hash table. Imported from GitHub Url: https://github.com/stratosphere/stratosphere/issues/685 Created by: [fhueske|https://github.com/fhueske] Labels: enhancement, java api, runtime, Milestone: Release 0.6 (unplanned) Created at: Mon Apr 14 12:05:29 CEST 2014 State: open -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122118573 Hi @fhueske , I'm very sorry, this is the first time I work with github, I am doing some mess :| But, I think I finally succeeded in removing the unwanted commits for documentation. I am not working on the code any more. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1963) Improve distinct() transformation
[ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630431#comment-14630431 ] ASF GitHub Bot commented on FLINK-1963: --- Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122118573 Hi @fhueske , I'm very sorry, this is the first time I work with github, I am doing some mess :| But, I think I finally succeeded in removing the unwanted commits for documentation. I am not working on the code any more. Thanks Improve distinct() transformation - Key: FLINK-1963 URL: https://issues.apache.org/jira/browse/FLINK-1963 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Assignee: pietro pinoli Priority: Minor Labels: starter Fix For: 0.9 The `distinct()` transformation is a bit limited right now with respect to processing atomic key types: - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard expression should also be supported for atomic key types - `distinct()` only works for composite types, but should also work for atomic key types - `distinct(KeySelector)` is the most generic one, but not very handy to use - `distinct(int ...)` works only for Tuple data types (which is fine) Fixing this should be rather easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)