[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37138063 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,108 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + ul + li + strongContinuous Histograms/strong: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br +A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. + /li + li +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + /li + /ul + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be + declared discrete-valued, and is provided as an array of indices of fields which are discrete. + br + The following information is available as part of `DataStats`: + ul +liNumber of elements in `X`/li +liDimension of `X`/li +liColumn-wise statistics where for discrete fields, we report counts for each category, and + the Gini impurity and Entropy of the field, while for continuous fields, we report the + minimum, maximum, mean and variance. +/li + /ul + +## Examples + +{% highlight scala %} + +import org.apache.flink.ml.statistics._ +import org.apache.flink.ml._ + +val X: DataSet[Double] = ... +// Create continuous histogram +val histogram = X.createHistogram(5) // creates a histogram with five bins +histogram.quantile(0.3) // returns the 30th quantile +histogram.sum(4) // returns number of elements less than 4 + +// Create categorical histogram +val histogram = X.createHistogram(0) // creates a categorical histogram +histogram.count(3) // number of elements with cateogory value 3 --- End diff -- cateogory - category --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/861#discussion_r37138093 --- Diff: docs/libs/ml/statistics.md --- @@ -0,0 +1,108 @@ +--- +mathjax: include +htmlTitle: FlinkML - Statistics +title: a href=../mlFlinkML/a - Statistics +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The statistics utility provides features such as building histograms over data, determining + mean, variance, gini impurity, entropy etc. of data. + +## Methods + + The Statistics utility provides two major functions: `createHistogram` and `dataStats`. + +### Creating a histogram + + There are two types of histograms: + ul + li + strongContinuous Histograms/strong: These histograms are formed on a data set `X: DataSet[Double]` + when the values in `X` are from a continuous range. These histograms support + `quantile` and `sum` operations. Here `quantile(q)` refers to a value $x_q$ such that $|x: x + \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of elements $x \leq s$, which can +be construed as a cumulative probability value at $s$[Of course, iscaled/i probability]. + br +A continuous histogram can be formed by calling `X.createHistogram(b)` where `b` is the +number of bins. + /li + li +strongCategorical Histograms/strong: These histograms are formed on a data set `X:DataSet[Double]` +when the values in `X` are from a discrete distribution. These histograms +support `count(c)` operation which returns the number of elements associated with cateogry `c`. +br +A categorical histogram can be formed by calling `X.createHistogram(0)`. + /li + /ul + +### Data Statistics + + The `dataStats` function operates on a data set `X: DataSet[Vector]` and returns column-wise + statistics for `X`. Every field of `X` is allowed to be defined as either idiscrete/i or + icontinuous/i. + br + Statistics can be evaluated by calling `DataStats.dataStats(X)` or + `DataStats.dataStats(X, discreteFields`). The latter is used when some fields are needed to be + declared discrete-valued, and is provided as an array of indices of fields which are discrete. + br + The following information is available as part of `DataStats`: + ul +liNumber of elements in `X`/li +liDimension of `X`/li +liColumn-wise statistics where for discrete fields, we report counts for each category, and + the Gini impurity and Entropy of the field, while for continuous fields, we report the + minimum, maximum, mean and variance. +/li + /ul + +## Examples + +{% highlight scala %} + +import org.apache.flink.ml.statistics._ +import org.apache.flink.ml._ + +val X: DataSet[Double] = ... +// Create continuous histogram +val histogram = X.createHistogram(5) // creates a histogram with five bins +histogram.quantile(0.3) // returns the 30th quantile +histogram.sum(4) // returns number of elements less than 4 --- End diff -- Is this example valid? `RichDoubleDataSet.createHistogram` returns `DataSet[OnlineHistogram]` and it doesn't have methods such as `quantile`, `sum`, ..., etc.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-131425060 Hi @sachingoel0101, Thanks for your contribution. I reviewed this PR and commented the source code. There are some problems which aren't commented. In documentation, there are many lines including `a`, `ul`, `li`, 'br', 'strong' and 'i' tag. These can be replaced with markdown syntax. And some codes (`MLUtils.createHistogram`, `ColumnStatistics`) are not formatted. I didn't finish reviewing process of building histogram. After reading full source code and the given paper, I'll add comment about the process. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-131086738 @kno10 Thanks for the comment. In this case we don't need to parallelize R-Tree because R-Tree is only used in reducer for matching records of the given block pair. But I agree that exact k-NN implementation doesn't fit large-scale data. We can discuss in other [JIRA issue](https://issues.apache.org/jira/browse/FLINK-1934). I'm inclined to close this PR and FLINK-1745 because many people says exact k-NN is not good and I think so. May I close them? @thvasilo @tillrohrmann --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2507]Rename the function tansformAndEmi...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/1007#issuecomment-129850783 Why the permissions of file are changed from 644 to 755? Other changes seems good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink-2030][ml]Data Set Statistics and Histog...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/861#issuecomment-129336972 Hi, I just discovered the review request. I'll review this PR soon. Because I'm busy in working for my graduation essay, maybe I can start reviewing on weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1962] Add Gelly Scala API
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/808#issuecomment-128638172 [Your last commit](https://github.com/PieterJanVanAeken/flink/commit/463a1f30b3b0785f46c76f9c290da3deec26) includes all updates of master branch. Please remove last commit (`git reset --hard HEAD~1` on your master branch) and rebase your branch on master branch (`git rebase apache/master`) and push to your branch with force update (`git push origin +master`). Additionally, working on non-master branch is better than on master branch. If you work on non-master branch, you can sync your master branch to apache master branch while you are changing your branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2314]
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/997#discussion_r36478648 --- Diff: flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/api/functions/source/FileSourceFunction.java --- @@ -119,13 +124,20 @@ public void run(SourceContextOUT ctx) throws Exception { while (isRunning) { OUT nextElement = serializer.createInstance(); nextElement = format.nextRecord(nextElement); - if (nextElement == null splitIterator.hasNext()) { + if (nextElement == null splitIterator.hasNext() ) { --- End diff -- unnecessary space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: FLINK-2166. Add fromCsvFile() method to TableE...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/939#issuecomment-126621482 Just my opinion, `TableEnvironment` is located under `org.apache.flink.api.java.table` because of unifying of Table API implementation. But Table API is implemented on Scala. I think that using Scala API is proper for this. @aljoscha @fhueske How do you think about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2446]Fix SocketTextStreamFunction has m...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/965#issuecomment-12669 Nice catch! Looks good to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2408] Define all maven properties outsi...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/941#issuecomment-125185305 Hi, pom.xml file of flink-scala module has also profile based dependency setting. Is it okay without modification? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-125271970 There is wrong description in building documentation. Because Scala 2.11 profile activation is determined by property scala-2.11, Scala 2.10 profile is activated when we execute `mvn -Pscala-2.11`. Following commands is right: ``` mvn clean install -DskipTests -Dscala-2.11 ``` About the shading artifacts, your guess is right. Because Hadoop packages don't need Scala dependencies, I didn't add suffix to them. But if we need the suffix for them to maintain uniformity, we can add the suffix. How do you think? I just found another problem. I opened the dependency reduced pom in my maven repository, there are some property expressions in artifact id. For example, the following is head of flink-runtime module's pom.xml: ``` project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd; parent artifactIdflink-parent${scala.suffix}/artifactId groupIdorg.apache.flink/groupId version0.10-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent modelVersion4.0.0/modelVersion artifactIdflink-runtime${scala.suffix}/artifactId nameflink-runtime/name build plugins plugin ``` As you see, there are property expressions (${scala.suffix}) in artifactId. I think that it can be a problem. How can I solve this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-125319617 When we upgrade version of maven-shade-plugin to 2.4.1, the property in current project artifactId is interpolated properly. But the property in parent artifactId is not interpolated. I think that it is a bug of maven-shade-plugin. I posted an [issue](https://issues.apache.org/jira/browse/MSHADE-200) to maven-shade-plugin JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-124521433 Could you post your command to compile Flink with Scala 2.11? The current setting works well in my environment. Maven module definitions is not artifact id but directory name. So we should keep current setting. I'm adding the suffix into the pom except quickstart. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-123642852 Hi, currently this PR is not ready to merge, because this PR doesn't contain changes for #677. I'll update soon. Unfortunately I'm outside now. Maybe I can update this PR in 4-5 hours. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2377] Add reader.close() to readAllResu...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/924#issuecomment-122893574 Great catch! Looks good to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2105] Implement Sort-Merge Outer Join a...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/907#issuecomment-122376238 Hi, I am reviewing this changes. I'm not done yet but I found some points which are able to improve. First, there are some duplicated classes such as `SimpleFlatJoinFunction`, `MatchRemovingMatcher`, `Match`, `CollectionIterator`. I think that we can this classes move under `org.apache.flink.runtime.operators.testutils` package. After moving them, they can be shared with test cases for hash-based outer join. Second, this is just my opinion, how about creating iterator classes for each outer join type such as, `AbstractMergeLeftOuterJoinIterator`, `AbstractMergeRightOuterJoinIterator`, `AbstractMergeFullOuterJoinIterator` and derived classes by reusing variable? I'm concerned about time consuming by comparing outer join type for many records in `callWithNextKey` method. The outer join type is already decided before doing join operation. But I'm not sure that there is obvious performance decrease by this comparing. If the performance decrease is negligible, the second suggestion could be ignored. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Update README.md
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/912#issuecomment-121515483 Hi, Thanks for your contribution. But I think this PR is not necessary because the change is not specific to Flink but about the general of git. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Updated method documentation in joinDataSet.sc...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/909#issuecomment-121246257 Looks good. :) +1 for merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121283389 @pp86 Looks good to merge. :) If another committer gives LGTM to this PR, I'll merge this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-297] [web frontend] First part of JobMa...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/677#issuecomment-121441621 Looks good to merge. After merging this PR, we need to modify PR #885. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-120913343 Hi, @pp86 Thanks for your contribution. But I think that using `AutoSelector` is not the best approach to improve distinct transformation. In Flink, a `KeySelector` converts a `DataSetO` to `DataSetTuple2K, O` and uses the first element of the tuple as key. For atomic types, `AutoSelector` creates `DataSetTuple2V, V` which unnecessarily duplicated data. I recommend `Keys.ExpressionKeys` when the user call `distinct()` method on atomic data types. And It would be better to add the test cases for this changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-121119196 @pp86 It seems okay but we need to check this change with some test cases. Could you add some test cases into `DistinctITCase` in `flink-tests` module? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/905#issuecomment-120922707 @pp86 Hi you can modify this pull request by adding commit in your branch. (pp86:master) I think reopening this pull request and adding commit is better than opening new pull request. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2218] Web client cannot distinguesh bet...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/904#issuecomment-120707231 +1 for renaming. I confused the difference between the options and arguments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2337] Multiple SLF4J bindings using Sto...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/903#issuecomment-120603967 +1 for merging :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [doc] Fix wordcount example in YARN setup
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/897#issuecomment-120384704 Merging... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-120554092 If I get 1 more LGTM, I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-120411775 Hi, I updated PR :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/885#discussion_r34067611 --- Diff: docs/apis/programming_guide.md --- @@ -187,7 +187,17 @@ that creates the type information for Flink operations. /div /div + Scala Dependency Versions +Because Scala 2.10 binary is not compatible with Scala 2.11 binary, we provide multiple artifacts +to support both Scala versions. If you want to run your program on Flink with Scala 2.11, you need +to add a suffix `_2.11` to all Flink artifact ids in your dependencies. You should be careful with +this difference of artifact id. All modules with Scala 2.11 have a suffix `_2.11` in artifact id. +For example, `flink-java` should be changed to `flink-java_2.11` and `flink-clients` should be +changed to `flink-clients_2.11`. --- End diff -- @aalexandrov I'm not sure it is good to use the keyword Scala-dependent. We cross-build all modules including java-based but linked with Scala modules. I think that omitting Scala-dependent is much better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/696#discussion_r34017778 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/KNN.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.classification + +import org.apache.flink.api.common.operators.Order +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala.DataSetUtils._ +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common._ +import org.apache.flink.ml.math.Vector +import org.apache.flink.ml.metrics.distances.{DistanceMetric, EuclideanDistanceMetric} +import org.apache.flink.ml.pipeline.{FitOperation, PredictDataSetOperation, Predictor} +import org.apache.flink.util.Collector + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + +/** Implements a k-nearest neighbor join. + * + * This algorithm calculates `k` nearest neighbor points in training set for each points of + * testing set. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = ... + * val testingDS: DataSet[Vector] = ... + * + * val knn = KNN() + * .setK(10) + * .setBlocks(5) + * .setDistanceMetric(EuclideanDistanceMetric()) + * + * knn.fit(trainingDS) + * + * val predictionDS: DataSet[(Vector, Array[Vector])] = knn.predict(testingDS) + * }}} + * + * =Parameters= + * + * - [[org.apache.flink.ml.classification.KNN.K]] + * Sets the K which is the number of selected points as neighbors. (Default value: '''None''') + * + * - [[org.apache.flink.ml.classification.KNN.Blocks]] + * Sets the number of blocks into which the input data will be split. This number should be set + * at least to the degree of parallelism. If no value is specified, then the parallelism of the + * input [[DataSet]] is used as the number of blocks. (Default value: '''None''') + * + * - [[org.apache.flink.ml.classification.KNN.DistanceMetric]] + * Sets the distance metric to calculate distance between two points. If no metric is specified, + * then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used. (Default value: + * '''EuclideanDistanceMetric()''') + * + */ +class KNN extends Predictor[KNN] { + + import KNN._ + + var trainingSet: Option[DataSet[Block[Vector]]] = None + + /** Sets K +* @param k the number of selected points as neighbors +*/ + def setK(k: Int): KNN = { +require(k 1, K must be positive.) +parameters.add(K, k) +this + } + + /** Sets the distance metric +* @param metric the distance metric to calculate distance between two points +*/ + def setDistanceMetric(metric: DistanceMetric): KNN = { +parameters.add(DistanceMetric, metric) +this + } + + /** Sets the number of data blocks/partitions +* @param n the number of data blocks +*/ + def setBlocks(n: Int): KNN = { +require(n 1, Number of blocks must be positive.) +parameters.add(Blocks, n) +this + } +} + +object KNN { + + case object K extends Parameter[Int] { +val defaultValue: Option[Int] = None + } + + case object DistanceMetric extends Parameter[DistanceMetric] { +val defaultValue: Option[DistanceMetric] = Some(EuclideanDistanceMetric()) + } + + case object Blocks extends Parameter[Int] { +val defaultValue: Option[Int] = None + } + + def apply(): KNN = { +new KNN() + } + + /** [[FitOperation]] which trains a KNN based on the given training data set. +* @tparam T Subtype of [[Vector]] +*/ + implicit def fitKNN[T : Vector : TypeInformation] = new FitOperation[KNN, T
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-119163944 @thvasilo Yeah exact k-NN is not scalable for gigabytes-sized, terabytes-sized data. If I add R-Tree to this algorithm, the algorithm would be better. But I agree that we need voting or discussion about this. I am also interested in approximate k-NN but we should check progress to the assigned contributor. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-118792663 @thvasilo Thanks :) I'll update this pull request soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/696#discussion_r33921518 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/KNN.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.classification + +import org.apache.flink.api.common.operators.Order +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala.DataSetUtils._ +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common._ +import org.apache.flink.ml.math.Vector +import org.apache.flink.ml.metrics.distances.{DistanceMetric, EuclideanDistanceMetric} +import org.apache.flink.ml.pipeline.{FitOperation, PredictDataSetOperation, Predictor} +import org.apache.flink.util.Collector + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + +/** Implements a k-nearest neighbor join. + * + * This algorithm calculates `k` nearest neighbor points in training set for each points of + * testing set. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = ... + * val testingDS: DataSet[Vector] = ... + * + * val knn = KNN() + * .setK(10) + * .setBlocks(5) + * .setDistanceMetric(EuclideanDistanceMetric()) + * + * knn.fit(trainingDS) + * + * val predictionDS: DataSet[(Vector, Array[Vector])] = knn.predict(testingDS) + * }}} + * + * =Parameters= + * + * - [[org.apache.flink.ml.classification.KNN.K]] + * Sets the K which is the number of selected points as neighbors. (Default value: '''None''') + * + * - [[org.apache.flink.ml.classification.KNN.Blocks]] + * Sets the number of blocks into which the input data will be split. This number should be set + * at least to the degree of parallelism. If no value is specified, then the parallelism of the + * input [[DataSet]] is used as the number of blocks. (Default value: '''None''') + * + * - [[org.apache.flink.ml.classification.KNN.DistanceMetric]] + * Sets the distance metric to calculate distance between two points. If no metric is specified, + * then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used. (Default value: + * '''EuclideanDistanceMetric()''') + * + */ +class KNN extends Predictor[KNN] { + + import KNN._ + + var trainingSet: Option[DataSet[Block[Vector]]] = None + + /** Sets K +* @param k the number of selected points as neighbors +*/ + def setK(k: Int): KNN = { +require(k 1, K must be positive.) +parameters.add(K, k) +this + } + + /** Sets the distance metric +* @param metric the distance metric to calculate distance between two points +*/ + def setDistanceMetric(metric: DistanceMetric): KNN = { +parameters.add(DistanceMetric, metric) +this + } + + /** Sets the number of data blocks/partitions +* @param n the number of data blocks +*/ + def setBlocks(n: Int): KNN = { +require(n 1, Number of blocks must be positive.) +parameters.add(Blocks, n) +this + } +} + +object KNN { + + case object K extends Parameter[Int] { +val defaultValue: Option[Int] = None + } + + case object DistanceMetric extends Parameter[DistanceMetric] { +val defaultValue: Option[DistanceMetric] = Some(EuclideanDistanceMetric()) + } + + case object Blocks extends Parameter[Int] { +val defaultValue: Option[Int] = None + } + + def apply(): KNN = { +new KNN() + } + + /** [[FitOperation]] which trains a KNN based on the given training data set. +* @tparam T Subtype of [[Vector]] +*/ + implicit def fitKNN[T : Vector : TypeInformation] = new FitOperation[KNN, T
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/885#issuecomment-118611879 @aalexandrov Thank you for review. :) I will applying your suggestion and update this PR. But in this changes, all modules require the suffix if the module is linked with Scala 2.11. So we don't need list of modules. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Collect(): Fixing the akka.framesize size limi...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/887#issuecomment-118644497 FLINK-2319, I leave a commit with the JIRA ticket to track changes of this PR. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/885 [FLINK-2200] Add Flink with Scala 2.11 in Maven Repository Hi, This PR contains following changes: * Add a suffix `_2.11` to all maven modules by profile setting. (`-Pscala-2.11`) * Add documentation about using Flink with Scala 2.11. I defined rules to add the suffix `_2.11` to maven modules. (From suggestion and discussion in mailing list, thanks @rmetzger) 1. All modules with Scala 2.10 don't have any suffix. (`flink-java`, `flink-core`, `flink-scala`, ..., etc.) 2. All modules except packaging as pom with Scala 2.11 have the suffix `_2.11`. (`flink-java_2.11`, `flink-core_2.11`, `flink-scala_2.11`, ..., etc.) * `flink-parent`, `flink-quickstart`, `flink-examples`, ..., and etc. don't have any suffix although they are built with Scala 2.11 because this modules are for only structuring project. 3. Currently, `flink-scala-shell` don't have any suffix. But in later, we have to add suffix to `flink-scala-shell`. I think that providing quickstart with Scala version variation such as `flink-quickstart-java`, `flink-quickstart-java_2.11` is good. But I cannot find the method to change architype result file (`pom.xml` in `src/main/resources/archetype-resources`) by maven profile. So I add explanation of default scala version in quickstart to documentation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-2200 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/885.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #885 commit e15ad6749c24a7bc6e2d37370c926981066a52e1 Author: Chiwan Park chiwanp...@apache.org Date: 2015-07-02T17:28:51Z [FLINK-2200] [build system] Add scala version suffix to artifact id for Scala 2.11 commit 3bfa184af168d538cbafb527b6238ff945997d95 Author: Chiwan Park chiwanp...@apache.org Date: 2015-07-03T09:16:16Z [FLINK-2200] [docs] Add documentation for Flink with Scala 2.11 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2272] [ml] [docs] Move vision and roadm...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/864#issuecomment-117860517 Good. +1 for merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-117059705 Hi, I updated this PR. I reimplemented kNN Join with `zipWithIndex` and fitted to changed pipeline architecture. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/832#discussion_r33419462 --- Diff: flink-tests/src/test/scala/org/apache/flink/api/scala/util/DataSetUtilsITCase.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.scala.util + +import org.apache.flink.api.scala._ +import org.apache.flink.test.util.{MultipleProgramsTestBase, TestBaseUtils} +import org.junit.rules.TemporaryFolder +import org.junit.runner.RunWith +import org.junit.runners.Parameterized +import org.junit.{After, Before, Rule, Test} +import org.apache.flink.api.scala.DataSetUtils.utilsToDataSet + +@RunWith(classOf[Parameterized]) +class DataSetUtilsITCase (mode: MultipleProgramsTestBase.TestExecutionMode) extends +MultipleProgramsTestBase(mode){ + + private var resultPath: String = null + private var expectedResult: String = null + + var tempFolder: TemporaryFolder = new TemporaryFolder() + + @Rule + def getFolder(): TemporaryFolder = { +tempFolder; + } + + @Before + @throws(classOf[Exception]) + def before { +resultPath = tempFolder.newFile.toURI.toString + } + + @Test + @throws(classOf[Exception]) + def testZipWithIndex { +val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment +env.setParallelism(1) + +val input: DataSet[String] = env.fromElements(A, B, C, D, E, F) +val result: DataSet[(Long, String)] = input.zipWithIndex + +result.writeAsCsv(resultPath, \n, ,) +env.execute() + +expectedResult = 0,A\n + 1,B\n + 2,C\n + 3,D\n + 4,E\n + 5,F + } + + @After + @throws(classOf[Exception]) + def after { --- End diff -- Add a parenthesis and return type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/832#issuecomment-116208070 Hi, I added some minor comments about coding style in Scala test case. The rest things is okay. I think we can merge this after fixing the style. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/832#discussion_r33419461 --- Diff: flink-tests/src/test/scala/org/apache/flink/api/scala/util/DataSetUtilsITCase.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.scala.util + +import org.apache.flink.api.scala._ +import org.apache.flink.test.util.{MultipleProgramsTestBase, TestBaseUtils} +import org.junit.rules.TemporaryFolder +import org.junit.runner.RunWith +import org.junit.runners.Parameterized +import org.junit.{After, Before, Rule, Test} +import org.apache.flink.api.scala.DataSetUtils.utilsToDataSet + +@RunWith(classOf[Parameterized]) +class DataSetUtilsITCase (mode: MultipleProgramsTestBase.TestExecutionMode) extends +MultipleProgramsTestBase(mode){ + + private var resultPath: String = null + private var expectedResult: String = null + + var tempFolder: TemporaryFolder = new TemporaryFolder() + + @Rule + def getFolder(): TemporaryFolder = { +tempFolder; + } + + @Before + @throws(classOf[Exception]) + def before { +resultPath = tempFolder.newFile.toURI.toString + } + + @Test + @throws(classOf[Exception]) + def testZipWithIndex { --- End diff -- Add a parenthesis and return type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/832#discussion_r33419459 --- Diff: flink-tests/src/test/scala/org/apache/flink/api/scala/util/DataSetUtilsITCase.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.scala.util + +import org.apache.flink.api.scala._ +import org.apache.flink.test.util.{MultipleProgramsTestBase, TestBaseUtils} +import org.junit.rules.TemporaryFolder +import org.junit.runner.RunWith +import org.junit.runners.Parameterized +import org.junit.{After, Before, Rule, Test} +import org.apache.flink.api.scala.DataSetUtils.utilsToDataSet + +@RunWith(classOf[Parameterized]) +class DataSetUtilsITCase (mode: MultipleProgramsTestBase.TestExecutionMode) extends +MultipleProgramsTestBase(mode){ + + private var resultPath: String = null + private var expectedResult: String = null + + var tempFolder: TemporaryFolder = new TemporaryFolder() + + @Rule + def getFolder(): TemporaryFolder = { +tempFolder; + } + + @Before + @throws(classOf[Exception]) + def before { --- End diff -- It would be better to insert a empty parenthesis and declare return type as Unit to match coding style of other test cases. I think `def before(): Unit = {` is better than now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/832#issuecomment-116252016 Looks good :) merging --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2152] Added zipWithIndex
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/832#issuecomment-116252346 Oops! I forgot add This closes #832 into commit message. I mistook because this is my first commit to upload Apache repository. Sorry. How can I fix it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2278] [ml] changed Vector fromBreeze
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/869#issuecomment-115636110 Nice catch and looks good to merge. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1962] Add Gelly Scala API
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/808#issuecomment-113147981 Hi. I'm very excited about Gelly's Scala API. I'm reading the changes from this PR. I found a problem about re-formatting. Unlike Flink's Java code, Flink's Scala code use 2 spaces as indent. But I think the IDE did re-format all code to use 4 spaces as indent. We should preserve the previous indent setting to preserve modification history. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-112745989 @thvasilo Good! Thank you. I'll update the implementation after #801 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-112733734 Hi. I updated the implementation of kNN using pipeline architecture. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2076] [runtime] Fix memory leakage in M...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/751#issuecomment-107478713 I fixed the bug related to non-null memory segment. The bug was caused by non-clearing `writeBehindBuffersAvailable` variable in `close()` method of `MutableHashTable` class. I added a test case for this bug. I have a problem for creating a test case to test the bug related to null memory segment. I think that it is related to parallelism also. If I run ConnectedComponents example with parallelism 1, the bug not occurs. So I cannot reproduce the state without ConnectedComponents example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2076] [runtime] Fix memory leakage in M...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/751#issuecomment-107287836 I was debugging with ConnectedComponents example and found the bug caused by null memory segment. I tried adding a test case to test re-openable hash table with small memory a moment ago. But I found another memory leakage not related to null memory segment. (All memory segments are not null) I think that more investigation is needed. I'll update this PR when I fix the bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2076] [runtime] Fix memory leakage in M...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/751 [FLINK-2076] [runtime] Fix memory leakage in MutableHashTable Hi. This PR contains a bug fix for [FLINK-2076](https://issues.apache.org/jira/browse/FLINK-2076). When `prepareNextPartition` method runs with some pending partitions, the memory leakage can occur at `memory.add(getNextBuffer())` statement (Line 515, 516). In common case, there are extra memory spaces for hash table so first and second call of `getNextBuffer()` return memory segment. But in extreme case such as a case described in JIRA, the second call of `getNextBuffer()` can return null and this null makes the memory leakage. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-2076 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/751.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #751 commit e9aa4d9a1ca36178f3f860b1997d5f92ada51c3f Author: Chiwan Park chiwanp...@icloud.com Date: 2015-05-30T18:27:17Z [FLINK-2076] [runtime] Fix memory leakage in MutableHashTable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2061] CSVReader: quotedStringParsing an...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/734#issuecomment-106313923 Okay. :) Because there is Stephen's email address in test code, I modified test code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2061] CSVReader: quotedStringParsing an...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/734 [FLINK-2061] CSVReader: quotedStringParsing and includeFields yields ParseException Fix the bug in `GenericCsvInputFormat` when skipped field is quoted string. I also added a unit test for this case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-2061 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/734.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #734 commit 99fb79beda88c73d80e630aa5e22e9ee401538ed Author: Chiwan Park chiwanp...@icloud.com Date: 2015-05-27T13:24:59Z [FLINK-2061] [java api] Fix GenericCsvInputFormat skipping fields error with quoted string --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/696 [FLINK-1745] [ml] [WIP] Add exact k-nearest-neighbours algorithm to machine learning library This PR is not final but work in progress. You can see detail description in [JIRA](https://issues.apache.org/jira/browse/FLINK-1745). You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1745 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/696.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #696 commit 804abe12c9503265171291a2841332341fe972be Author: Chiwan Park chiwanp...@icloud.com Date: 2015-05-15T05:45:50Z kNN join first draft --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2001] [ml] Fix DistanceMetric serializa...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/668 [FLINK-2001] [ml] Fix DistanceMetric serialization error * `DistanceMetric` extends Serializable * Add simple serialization test You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-2001 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #668 commit 53726c12c8af09dfbe17903df3efcc1308da0540 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-05-12T06:11:25Z [FLINK-2001] [ml] Fix DistanceMetric serialization error --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1933] Add distance measure interface an...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/629#discussion_r29837576 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/math/metrics/distances/CosineDistanceMeasure.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.math.metrics.distances + +import org.apache.flink.ml.math.Vector + +/** This class implements a cosine distance metric. The class calculates the distance between + * the given vectors by dividing the dot product of two vectors by the product of their lengths. + * We convert the result of division to a usable distance. So, 1 - cos(angle) is actually returned. + * + * @see http://en.wikipedia.org/wiki/Cosine_similarity + */ +class CosineDistanceMeasure extends DistanceMeasure { + override def distance(a: Vector, b: Vector): Double = { +checkValidArguments(a, b) + +val dotProd: Double = a.dot(b) +val denominator: Double = a.magnitude * b.magnitude +if (dotProd == 0 denominator == 0) { --- End diff -- @tillrohrmann That is for case with zero-vector. Without the code dealing the zero-vector case, we got a `DivisionByZeroError` in the case. I followed [the result of Wolfram Alpha](http://goo.gl/NXGLgo) and [implementation of Mahout](https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/common/distance/CosineDistanceMeasure.java#L66) for resolving this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1933] Add distance measure interface an...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/629#issuecomment-99803730 @tillrohrmann I just renamed the classes. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1933] Add distance measure interface an...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/629#issuecomment-99803806 Oh, the commit logs contains distance measure. I will fix them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1933] Add distance measure interface an...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/629#issuecomment-99531887 I added a overview documentation. Please review and comment if there are some errors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1855] SocketTextStreamWordCount example...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/644 [FLINK-1855] SocketTextStreamWordCount example cannot be run from the webclient This PR fixes [FLINK-1855](https://issues.apache.org/jira/browse/FLINK-1855). Tested in Local Cluster with Oracle JDK7, JDK8. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1855 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/644.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #644 commit 273c1707fa808a8405e3887dfa91a746485b3283 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-05-04T04:45:02Z [FLINK-1855] [streaming] Add WordCount class into WindowWordCount and SocketTextStreamWordCount example jars --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1937] [ml] Fixes sparse vector/matrix c...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/636#issuecomment-97100649 Looks good to merge. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1933] Add distance measure interface an...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/629 [FLINK-1933] Add distance measure interface and basic implementation to machine learning library This PR contains following changes: * Add `dot` method and `magnitude` method. * Add `DistanceMeasure` trait. * Add 7 basic implementation of `DistanceMeasure`. * Add tests for above changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1933 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/629.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #629 commit 3fac0ff339de93ab1c3b3582924af75a1e6057ea Author: Chiwan Park chiwanp...@icloud.com Date: 2015-04-24T03:50:28Z [FLINK-1933] [ml] Add dot product and magnitude into Vector commit c8f940c2439f754ef0e640b5440507bce4b859d2 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-04-27T09:48:56Z [FLINK-1933] [ml] Add distance measure interface and basic implementation to machine learning library --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-703] Use complete element as join key
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/572#issuecomment-94512484 I updated this PR. Thanks for advice. :) * Remove `setParallelism(1)` in test code. * Simplify `testGroupReduceWithAtomicValue` in `GroupReduceITCase`. * Add a `null` check for `expressionsIn` in constructor of `ExpressionKeys`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-703] Use complete element as join key
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/572#issuecomment-94188129 Hi. I updated this PR. The changes are following. * Re-implement this feature with generalizing `ExpressionKeys`. * Modify `CoGroupOperatorBase`, `GroupCombineOperatorBase` and `GroupReduceOperatorBase` to allow use of AtomicType as Key. * Add unit tests to test invalid usage. Because a mention for wildcard expression with atomic type exists already in documentation. I didn't modify documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1906] [docs] Add tip to work around pla...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/611 [FLINK-1906] [docs] Add tip to work around plain Tuple return type of project operator Add a tip about project transformation with type hinting in documentation. Related JIRA is [here](https://issues.apache.org/jira/browse/FLINK-1906). You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1906 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/611.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #611 commit fd784be28c8f6bd85019e653a131975c36e7f2d0 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-04-18T18:24:16Z [FLINK-1906] [docs] Add tip to work around plain Tuple return type of project operator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-703] Use complete element as join key
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/572 [FLINK-703] Use complete element as join key Hello. I open a pull request about FLINK-703. You can find more detail description in [JIRA](https://issues.apache.org/jira/browse/FLINK-703). This PR contains following changes. * Add `BasicKeySelector` class to use complete element as key. * Add `checkForAtomicType` method in `Keys` class to check condition. * Modify `CoGroupOperator`, `JoinOperator`, `DataSet` in Java and Scala API * Add some unit tests and integration tests to test this modification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/572.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #572 commit e6804bbf4d5ec345a723d692da26b277a1cabf04 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-04-05T18:18:23Z [FLINK-703] [java api] Use complete element as join key commit 862c5a5608cd7fd8af22c5a781e87ef0ece79e85 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-04-05T20:07:11Z [FLINK-703] [scala api] Use complete element as join key --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-86011099 Oops, I pushed a intermediate commit a8a5c37. I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-86011782 @fhueske You can check it now :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-85830275 Hi, I updated this PR. * Remove `pojoType(Class? targetType)` method in `CsvReader` to force the user to explicitly specify the fields order. * Add checking the fields order existence routine in `readCsvFile` method. * Add two integration tests for above 2 modifications. By the way, I cannot find out why Travis fails. In my computer, `mvn clean install -DskipTests` and `mvn verify` succeed. From [travis log](https://travis-ci.org/chiwanpark/flink/jobs/55747686#L7074), It seems that the problem relates with Gelly. Although I read some codes in Gelly, I cannot find what is exactly problem. Could anyone help me with this problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-85285193 @fhueske Hi, Thanks for your kindly advice! I will fix them soon. About the order of POJO fields, I think also that the option 3 is good. However, [FLINK-1665](https://issues.apache.org/jira/browse/FLINK-1665) is not implemented yet. I would implement the option 1 and 2 now. After FLINK-1665 completed, we can implement the option 3. About the inheritance of `ScalaCsvInputFormat`, I didn't think about Record API. Your opinion looks good. I will revert the changes and refactor `GenericCsvInputFormat` to contain duplicated methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-82885499 I updated this PR. * Change method of obtaining `Field` object from using `PojoTypeInfo` to saving field names. (Thanks @fhueske for advice!) * `ScalaCsvInputFormat` extends `CsvInputFormat` because there are many duplicated code between the two classes. * Add integration tests for `CsvReader` (Java API) and `ExecutionEnvironment.readCsvFile` (Scala API) Any feedback is welcome! (especially error message because of my poor english) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-78876571 Hello. I have a question about object reuse in `readRecord` method of `ScalaCsvInputFormat`. In java implementation, `CsvInputFormat` reuse result object. But in `ScalaCsvInputFormat`, we don't reuse object and create instance for each record. Why don't `ScalaCsvInputFormat` reuse object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-79066240 @aljoscha Thanks! I understand about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1654] Wrong scala example of POJO type ...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/478#issuecomment-78228121 As I wrote in JIRA, Fields of Scala class (not case class) are private. ([Reference](http://stackoverflow.com/questions/1589603/scala-set-a-field-value-reflectively-from-field-name)) Because fields declared by `val` keyword don't have setter, Flink TypeExtractor fails to extract information of POJO example in documentation. I attached a result of sample type extraction in [JIRA](https://issues.apache.org/jira/browse/FLINK-1654?focusedCommentId=14348124page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14348124). Flink TypeExtractor deals with Scala case class differently. Case classes are dealt like Scala tuple. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-78401359 @fhueske Thanks for your kindly advice. I will fix as soon as possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1654] Wrong scala example of POJO type ...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/478 [FLINK-1654] Wrong scala example of POJO type in documentation More detail description and discussion in [JIRA](https://issues.apache.org/jira/browse/FLINK-1654). You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1654 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/478.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #478 commit 398f77facd5aeb357a3e5e7825da60eca65e9435 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-03-11T01:40:17Z [FLINK-1654] [docs] Fix scala example in programming guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/426#issuecomment-77293135 @fhueske Oh, you are right. Currently, users cannot decide order of fields. I will add a parameter to set order of fields. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/426#discussion_r25760453 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/io/CsvInputFormat.java --- @@ -235,8 +252,21 @@ public OUT readRecord(OUT reuse, byte[] bytes, int offset, int numBytes) throws if (parseRecord(parsedValues, bytes, offset, numBytes)) { // valid parse, map values into pact record - for (int i = 0; i parsedValues.length; i++) { - reuse.setField(parsedValues[i], i); + if (pojoTypeInfo == null) { + Tuple result = (Tuple) reuse; + for (int i = 0; i parsedValues.length; i++) { + result.setField(parsedValues[i], i); + } + } else { + for (int i = 0; i parsedValues.length; i++) { + try { + pojoTypeInfo.getPojoFieldAt(i).field.set(reuse, parsedValues[i]); --- End diff -- @rmetzger Thanks! I modify my implementation to set the fields accessible in `CsvInputFormat` and `ScalaCsvInputFormat` and add a test case with private fields. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1512] Add CsvReader for reading into PO...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/426 [FLINK-1512] Add CsvReader for reading into POJOs. This PR contains following changes. * `CsvInputFormat` and `ScalaCsvInputFormat` can receive POJO type as generic parameter * Add `pojoType(ClassT targetType)` into `CsvReader` (Java API) * Modify `readCsvFile` method in `ExecutionEnvironment` (Scala API) * Add unit tests for `CsvInputFormat` and `ScalaCsvInputFormat` You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1512 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/426.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #426 commit 2463d6b7d244528e1625288e1b780335769f14ee Author: Chiwan Park chiwanp...@icloud.com Date: 2015-02-18T18:27:59Z [FLINK-1512] [java api] Add CsvReader for reading into POJOs commit 8fe5f8d1bd402382e6fa93014c5b2fec8e22cbd0 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-02-19T17:23:56Z [FLINK-1512] [scala api] Add CsvReader for reading into POJOs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1179] Add button to JobManager web inte...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/374#issuecomment-73509674 @StephanEwen Thanks for your advice! I fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1179] Add button to JobManager web inte...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/374#issuecomment-73410413 @tillrohrmann Thanks for your advice. I will fix it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1179] Add button to JobManager web inte...
GitHub user chiwanpark opened a pull request: https://github.com/apache/flink/pull/374 [FLINK-1179] Add button to JobManager web interface to request stack trace of a TaskManager This PR contains following changes: * Add public constructors of `org.apache.flink.runtime.instance.InstanceID` for sending instance ID from web interface to job manager * Add a helper method called `getRegisteredInstanceById(InstanceID)` into `org.apache.flink.runtime.instance.InstanceManager` for finding Akka Actor from instance ID * Add akka messages called `RequestStackTrace`, `SendStackTrace` and `StackTrace` * Modify a task manager page in web interface of job manager to request and show stack trace of a task manager The following image is a screenshot of web interface of job manager. ![screen shot 2015-02-08 at 3 49 51 pm](https://cloud.githubusercontent.com/assets/1941681/6095765/9293e996-afaf-11e4-9e8e-4dcd69ce595b.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/chiwanpark/flink FLINK-1179 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/374.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #374 commit 83ab236ce52dd8c3b0aa0b94ee8644a4de28e152 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-02-08T06:36:19Z [FLINK-1179] [runtime] Add helper method for InstanceID commit a2a0a0f8261851c93330417f3c60a16c5f1d2dd5 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-02-08T07:05:29Z [FLINK-1179] Add internal API for obtaining StackTrace commit 423e64ca4cc6ad4a9396e4418eab95ce5b81b219 Author: Chiwan Park chiwanp...@icloud.com Date: 2015-02-08T07:10:03Z [FLINK-1179] [jobmanager] Add button to JobManager web interface to request stack trace of a TaskManager --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---