[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122117624 Hi @pp86, are you still working on this PR? I would suggest to open a new PR for the documentation (FLINK-2362). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122121617 No worries :-) You better use a dedicated branch for your next PR instead of the master branch. I'll go ahead and merge this PR now. Thanks for the contribution! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122132484 Merged, thanks for the contribution @pp86 ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/905 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34847588 --- Diff: flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java --- @@ -274,4 +277,45 @@ public Integer map(POJO value) throws Exception { return (int) value.nestedPojo.longNumber; } } + + @Test + public void testCorrectnessOfDistinctOnAtomic() throws Exception { + /* + * check correctness of distinct on Integers + */ + + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env); + DataSetInteger reduceDs = ds.distinct(); + + ListInteger result = reduceDs.collect(); + + String expected = 1\n2\n3\n4\n5; + + compareResultAsText(result, expected); + } + + @Test + public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws Exception { + /* --- End diff -- Looks like misaligned for the comment block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122131578 Good catch @hsaputra :-) I'll fix it before merging --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-122118573 Hi @fhueske , I'm very sorry, this is the first time I work with github, I am doing some mess :| But, I think I finally succeeded in removing the unwanted commits for documentation. I am not working on the code any more. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121739943 Dear @fhueske , I hope I satisfied your requests! Let me know if some other changes are needed. As you suggested I took the issue on documentation. Thanks, Pietro --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121775689 Looks good to me, thanks! Will try the changes tomorrow and merge. Thanks for helping with the documentation! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121192327 hello @chiwanpark! what do you think about the tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121187436 sure thing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121286738 Hi, thanks for the pull request. I noticed that the `CollectionDataSets.getStringDataSet` data set does not contain any duplicates. Therefore, it is not well suited to test the distinct operator. There also a few more tests for the Distinct operator that need to be adapted: - DistinctITCase.java - DistinctOperatorTest.java - DistinctOperatorTest.scala The Unit tests (*Test) check if correct arguments are accepted and invalid arguments are rejected. It would be nice, if you could try to do a distinct() on a DataSetYourObject, where YourObject is not a Pojo but a generic type which does not implement `Comparable`, i.e., it is not a Key type. Thank you, Fabian --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121283389 @pp86 Looks good to merge. :) If another committer gives LGTM to this PR, I'll merge this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121406113 Hi guys, I added some tests. I also added a test on a generic data type which does not implement Comparable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34626753 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java --- @@ -618,9 +620,9 @@ public long count() throws Exception { } /** -* Returns a distinct set of a {@link Tuple} {@link DataSet} using all fields of the tuple. +* Returns a distinct set of a {@link DataSet} using all fields of the tuple. * p -* Note: This operator can only be applied to Tuple DataSets. +* If the input is a {@link Tuple} {@link DataSet}, uses all fields of the tuple. --- End diff -- Can you change this to If the input is a composite type (Tuple or Pojo type), distinct is performed on all fields and each field must be a key type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34626780 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java --- @@ -618,9 +620,9 @@ public long count() throws Exception { } /** -* Returns a distinct set of a {@link Tuple} {@link DataSet} using all fields of the tuple. +* Returns a distinct set of a {@link DataSet} using all fields of the tuple. --- End diff -- Can you remove the using all fields of the tuple part? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/905#discussion_r34627102 --- Diff: flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java --- @@ -274,4 +277,81 @@ public Integer map(POJO value) throws Exception { return (int) value.nestedPojo.longNumber; } } + + @Test + public void testCorrectnessOfDistinctOnAtomic() throws Exception { + /* + * check correctness of distinct on Integers + */ + + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env); + DataSetInteger reduceDs = ds.distinct(); + + ListInteger result = reduceDs.collect(); + + String expected = 1\n2\n3\n4\n5; + + compareResultAsText(result, expected); + } + + @Test + public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws Exception { + /* + * check correctness of distinct on Strings, using Keys.ExpressionKeys.SELECT_ALL_CHAR + */ + + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + DataSetString ds = CollectionDataSets.getStringDataSet(env); + DataSetString reduceDs = ds.union(ds).distinct(*); + + ListString result = reduceDs.collect(); + + String expected = I am fine.\n + + Luke Skywalker\n + + LOL\n + + Hello world, how are you?\n + + Hi\n + + Hello world\n + + Hello\n + + Random comment\n; + + compareResultAsText(result, expected); + } + + @Test(expected = InvalidProgramException.class) + public void testDistinctOnNotKeyDataType() throws Exception { --- End diff -- We usually add such failing tests to the unit tests (DistinctOperatorTest.java) instead of integration tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
GitHub user pp86 opened a pull request: https://github.com/apache/flink/pull/905 [FLINK-1963] Improve distinct() transformation - add the possibility to use distinct() on Atomic data types - add the possibility to use distinct(_)/distinct(*) on Atomic data types You can merge this pull request into a Git repository by running: $ git pull https://github.com/pp86/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/905.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #905 commit f9a69111c1050d45d14ffb332411af6663482dd5 Author: pietro pinoli pie...@pietros-mbp.lan Date: 2015-07-13T11:32:20Z [FLINK-1963] Improve distinct() transformation - add the possibility to use distinct() on Atomic data types - add the possibility to use distinct(_)/distinct(*) on Atomic data types commit da699987852d7481843d8b8b60c976d3d8e61a69 Author: pietro pinoli pie...@pietros-mbp.lan Date: 2015-07-13T11:58:28Z [FLINK-1963] Improve distinct() transformation ⦠- add the possibility to use distinct() on Atomic data types - add the possibility to use distinct(_)/distinct(*) on Atomic data types commit 911db2d2efcca227d1fdcf595cb6c5a2cd4e92ce Author: pietro pinoli pie...@pietros-mbp.lan Date: 2015-07-13T12:01:20Z Clean code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-120913343 Hi, @pp86 Thanks for your contribution. But I think that using `AutoSelector` is not the best approach to improve distinct transformation. In Flink, a `KeySelector` converts a `DataSetO` to `DataSetTuple2K, O` and uses the first element of the tuple as key. For atomic types, `AutoSelector` creates `DataSetTuple2V, V` which unnecessarily duplicated data. I recommend `Keys.ExpressionKeys` when the user call `distinct()` method on atomic data types. And It would be better to add the test cases for this changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121119196 @pp86 It seems okay but we need to check this change with some test cases. Could you add some test cases into `DistinctITCase` in `flink-tests` module? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-120922707 @pp86 Hi you can modify this pull request by adding commit in your branch. (pp86:master) I think reopening this pull request and adding commit is better than opening new pull request. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-120922481 Dear Chiwan, thanks for you suggestions. I'll integrate them and open a new pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user pp86 closed the pull request at: https://github.com/apache/flink/pull/905 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---