[GitHub] flink pull request #2071: [FLINK-4018](streaming-connectors) Configurable id...
GitHub user tzulitai opened a pull request: https://github.com/apache/flink/pull/2071 [FLINK-4018](streaming-connectors) Configurable idle time between getRecords requests to Kinesis shards Along with this new configuration and the already existing `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more control on the desired throughput behaviour for the Kinesis consumer. The default value for this new configuration is 500 milliseconds idle time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tzulitai/flink FLINK-4018 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2071.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2071 commit 74fcb62f62007c0396d2ae870298f35ae5ce Author: Gordon TaiDate: 2016-06-04T04:43:30Z [FLINK-4018] Add configuration for idle time between get requests to Kinesis shards --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards
[ https://issues.apache.org/jira/browse/FLINK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315316#comment-15315316 ] ASF GitHub Bot commented on FLINK-4018: --- GitHub user tzulitai opened a pull request: https://github.com/apache/flink/pull/2071 [FLINK-4018](streaming-connectors) Configurable idle time between getRecords requests to Kinesis shards Along with this new configuration and the already existing `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more control on the desired throughput behaviour for the Kinesis consumer. The default value for this new configuration is 500 milliseconds idle time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tzulitai/flink FLINK-4018 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2071.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2071 commit 74fcb62f62007c0396d2ae870298f35ae5ce Author: Gordon TaiDate: 2016-06-04T04:43:30Z [FLINK-4018] Add configuration for idle time between get requests to Kinesis shards > Configurable idle time between getRecords requests to Kinesis shards > > > Key: FLINK-4018 > URL: https://issues.apache.org/jira/browse/FLINK-4018 > Project: Flink > Issue Type: Sub-task > Components: Streaming Connectors >Affects Versions: 1.1.0 >Reporter: Tzu-Li (Gordon) Tai >Assignee: Tzu-Li (Gordon) Tai > > Currently, the Kinesis consumer is calling getRecords() right after finishing > previous calls. This results in easily reaching Amazon's limitation of 5 GET > requests per shard per second. Although the code already has backoff & retry > mechanism, this will affect other applications consuming from the same > Kinesis stream. > Along with this new configuration and the already existing > `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more > control on the desired throughput behaviour for the Kinesis consumer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards
[ https://issues.apache.org/jira/browse/FLINK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315290#comment-15315290 ] Tzu-Li (Gordon) Tai commented on FLINK-4018: We have a user who has tried out the connector and preparing for production next week with Flink & Kinesis. The only thing worrying is this issue. I'll PR within the next 24 hours for review. > Configurable idle time between getRecords requests to Kinesis shards > > > Key: FLINK-4018 > URL: https://issues.apache.org/jira/browse/FLINK-4018 > Project: Flink > Issue Type: Sub-task > Components: Streaming Connectors >Affects Versions: 1.1.0 >Reporter: Tzu-Li (Gordon) Tai >Assignee: Tzu-Li (Gordon) Tai > > Currently, the Kinesis consumer is calling getRecords() right after finishing > previous calls. This results in easily reaching Amazon's limitation of 5 GET > requests per shard per second. Although the code already has backoff & retry > mechanism, this will affect other applications consuming from the same > Kinesis stream. > Along with this new configuration and the already existing > `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more > control on the desired throughput behaviour for the Kinesis consumer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards
Tzu-Li (Gordon) Tai created FLINK-4018: -- Summary: Configurable idle time between getRecords requests to Kinesis shards Key: FLINK-4018 URL: https://issues.apache.org/jira/browse/FLINK-4018 Project: Flink Issue Type: Sub-task Components: Streaming Connectors Affects Versions: 1.1.0 Reporter: Tzu-Li (Gordon) Tai Assignee: Tzu-Li (Gordon) Tai Currently, the Kinesis consumer is calling getRecords() right after finishing previous calls. This results in easily reaching Amazon's limitation of 5 GET requests per shard per second. Although the code already has backoff & retry mechanism, this will affect other applications consuming from the same Kinesis stream. Along with this new configuration and the already existing `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more control on the desired throughput behaviour for the Kinesis consumer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1979) Implement Loss Functions
[ https://issues.apache.org/jira/browse/FLINK-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315212#comment-15315212 ] ASF GitHub Bot commented on FLINK-1979: --- Github user skavulya commented on the issue: https://github.com/apache/flink/pull/1985 @chiwanpark Decoupling the gradient descent step is complicated for L1 regularization because we are using the proximal gradient method that applies soft thresholding after executing the gradient descent step. I left the regularization penalty as-is. I am thinking of adding an additional method that adds the regularization penalty to gradient without the gradient descent step but I will do it in the L-BFGS PR instead. > Implement Loss Functions > > > Key: FLINK-1979 > URL: https://issues.apache.org/jira/browse/FLINK-1979 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Johannes Günther >Assignee: Johannes Günther >Priority: Minor > Labels: ML > > For convex optimization problems, optimizer methods like SGD rely on a > pluggable implementation of a loss function and its first derivative. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1985: [FLINK-1979] Add logistic loss, hinge loss and regulariza...
Github user skavulya commented on the issue: https://github.com/apache/flink/pull/1985 @chiwanpark Decoupling the gradient descent step is complicated for L1 regularization because we are using the proximal gradient method that applies soft thresholding after executing the gradient descent step. I left the regularization penalty as-is. I am thinking of adding an additional method that adds the regularization penalty to gradient without the gradient descent step but I will do it in the L-BFGS PR instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3872) Add Kafka TableSource with JSON serialization
[ https://issues.apache.org/jira/browse/FLINK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314656#comment-15314656 ] ASF GitHub Bot commented on FLINK-3872: --- Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/2069 Quickly checked the maven and kafka stuff. Looks good. Very clean, well documented and tested code. Further review for the table api specific parts are still needed. > Add Kafka TableSource with JSON serialization > - > > Key: FLINK-3872 > URL: https://issues.apache.org/jira/browse/FLINK-3872 > Project: Flink > Issue Type: New Feature > Components: Table API >Reporter: Fabian Hueske >Assignee: Ufuk Celebi > Fix For: 1.1.0 > > > Add a Kafka TableSource which reads JSON serialized data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2069: [FLINK-3872] [table, connector-kafka] Add KafkaJsonTableS...
Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/2069 Quickly checked the maven and kafka stuff. Looks good. Very clean, well documented and tested code. Further review for the table api specific parts are still needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3850) Add forward field annotations to DataSet operators generated by the Table API
[ https://issues.apache.org/jira/browse/FLINK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314630#comment-15314630 ] ramkrishna.s.vasudevan commented on FLINK-3850: --- [~fhueske] Is it possible that I could work on this issue, with your guidance and feedback? Let me know what you think. > Add forward field annotations to DataSet operators generated by the Table API > - > > Key: FLINK-3850 > URL: https://issues.apache.org/jira/browse/FLINK-3850 > Project: Flink > Issue Type: Improvement > Components: Table API >Reporter: Fabian Hueske > > The DataSet API features semantic annotations [1] to hint the optimizer which > input fields an operator copies. This information is valuable for the > optimizer because it can infer that certain physical properties such as > partitioning or sorting are not destroyed by user functions and thus generate > more efficient execution plans. > The Table API is built on top of the DataSet API and generates DataSet > programs and code for user-defined functions. Hence, it knows exactly which > fields are modified and which not. We should use this information to > automatically generate forward field annotations and attach them to the > operators. This can help to significantly improve the performance of certain > jobs. > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/1856 Thanks @zentol . I pushed a new one after wrapping the longer lines. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314611#comment-15314611 ] ASF GitHub Bot commented on FLINK-3650: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/1856 Thanks @zentol . I pushed a new one after wrapping the longer lines. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4016) FoldApplyWindowFunction is not properly initialized
[ https://issues.apache.org/jira/browse/FLINK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314326#comment-15314326 ] ASF GitHub Bot commented on FLINK-4016: --- GitHub user rvdwenden opened a pull request: https://github.com/apache/flink/pull/2070 [FLINK-4016] initialize FoldApplyWindowFunction properly You can merge this pull request into a Git repository by running: $ git pull https://github.com/rvdwenden/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2070.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2070 commit ec5eb4890201d6d0c9d91510f3b078a681d5a317 Author: rvdwendenDate: 2016-06-03T15:49:03Z [FLINK-4016] initialize FoldApplyWindowFunction properly > FoldApplyWindowFunction is not properly initialized > --- > > Key: FLINK-4016 > URL: https://issues.apache.org/jira/browse/FLINK-4016 > Project: Flink > Issue Type: Bug > Components: DataStream API >Affects Versions: 1.0.3 >Reporter: RWenden >Priority: Blocker > Labels: easyfix > Fix For: 1.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > FoldApplyWindowFunction's outputtype is not set. > We're using constructions like (excerpt): > .keyBy(0) > .countWindow(10, 5) > .fold(...) > Running this stream gives an runtime exception in FoldApplyWindowFunction: > "No initial value was serialized for the fold window function. Probably the > setOutputType method was not called." > This can be easily fixed in WindowedStream.java by (around line# 449): > FoldApplyWindowFunction foldApplyWindowFunction = new > FoldApplyWindowFunction<>(initialValue, foldFunction, function); > foldApplyWindowFunction.setOutputType(resultType, > input.getExecutionConfig()); > operator = new EvictingWindowOperator<>(windowAssigner, > > windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()), > keySel, > > input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()), > stateDesc, > new > InternalIterableWindowFunction<>(foldApplyWindowFunction), > trigger, > evictor); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2070: [FLINK-4016] initialize FoldApplyWindowFunction pr...
GitHub user rvdwenden opened a pull request: https://github.com/apache/flink/pull/2070 [FLINK-4016] initialize FoldApplyWindowFunction properly You can merge this pull request into a Git repository by running: $ git pull https://github.com/rvdwenden/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2070.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2070 commit ec5eb4890201d6d0c9d91510f3b078a681d5a317 Author: rvdwendenDate: 2016-06-03T15:49:03Z [FLINK-4016] initialize FoldApplyWindowFunction properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-4017) [py] Add Aggregation support to Python API
Geoffrey Mon created FLINK-4017: --- Summary: [py] Add Aggregation support to Python API Key: FLINK-4017 URL: https://issues.apache.org/jira/browse/FLINK-4017 Project: Flink Issue Type: Improvement Components: Python API Reporter: Geoffrey Mon Priority: Minor Aggregations are not currently supported in the Python API. I was getting started with setting up and working with Flink and figured this would be a relatively simple task for me to get started with. Currently working on this at https://github.com/geofbot/flink -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3872) Add Kafka TableSource with JSON serialization
[ https://issues.apache.org/jira/browse/FLINK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314238#comment-15314238 ] ASF GitHub Bot commented on FLINK-3872: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2069 [FLINK-3872] [table, connector-kafka] Add KafkaJsonTableSource Adds `StreamTableSource` variants for Kafka with syntactic sugar for parsing JSON streams. ```java KafkaJsonTableSource source = new Kafka08JsonTableSource( topic, props, new String[] { "id" }, // field names new Class[] { Long.class }); // field types tableEnvironment.registerTableSource("kafka-stream", source) ``` You can then continue to work with the stream: ```java Table result = tableEnvironment.ingest("kafka-stream").filter("id > 1000"); tableEnvironment.toDataStream(result, Row.class).print(); ``` **Limitations** - Assumes flat JSON field access (we can easily extend this to use JSON pointers, allowing us to parse nested fields like `/location/area` as field names). - This does not extract any timestamp or watermarks (not an issue right now as the Table API currently does not support operations where this is needed). - API is kind of cumbersome and non Scalaesque for the Scala Table API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 3872-kafkajson_table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2069.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2069 commit 12ec6a594d23bd36bed1e07eeaba2aa75a768f67 Author: Ufuk CelebiDate: 2016-06-02T20:38:23Z [FLINK-3872] [table, connector-kafka] Add JsonRowDeserializationSchema - Adds a deserialization schema from byte[] to Row to be used in conjunction with the Table API. commit a8dc3aa7ab70a91b12af2adccbbed821bf25ecc9 Author: Ufuk Celebi Date: 2016-06-03T13:24:22Z [FLINK-3872] [table, connector-kafka] Add KafkaTableSource > Add Kafka TableSource with JSON serialization > - > > Key: FLINK-3872 > URL: https://issues.apache.org/jira/browse/FLINK-3872 > Project: Flink > Issue Type: New Feature > Components: Table API >Reporter: Fabian Hueske >Assignee: Ufuk Celebi > Fix For: 1.1.0 > > > Add a Kafka TableSource which reads JSON serialized data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2069: [FLINK-3872] [table, connector-kafka] Add KafkaJso...
GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2069 [FLINK-3872] [table, connector-kafka] Add KafkaJsonTableSource Adds `StreamTableSource` variants for Kafka with syntactic sugar for parsing JSON streams. ```java KafkaJsonTableSource source = new Kafka08JsonTableSource( topic, props, new String[] { "id" }, // field names new Class[] { Long.class }); // field types tableEnvironment.registerTableSource("kafka-stream", source) ``` You can then continue to work with the stream: ```java Table result = tableEnvironment.ingest("kafka-stream").filter("id > 1000"); tableEnvironment.toDataStream(result, Row.class).print(); ``` **Limitations** - Assumes flat JSON field access (we can easily extend this to use JSON pointers, allowing us to parse nested fields like `/location/area` as field names). - This does not extract any timestamp or watermarks (not an issue right now as the Table API currently does not support operations where this is needed). - API is kind of cumbersome and non Scalaesque for the Scala Table API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 3872-kafkajson_table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2069.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2069 commit 12ec6a594d23bd36bed1e07eeaba2aa75a768f67 Author: Ufuk CelebiDate: 2016-06-02T20:38:23Z [FLINK-3872] [table, connector-kafka] Add JsonRowDeserializationSchema - Adds a deserialization schema from byte[] to Row to be used in conjunction with the Table API. commit a8dc3aa7ab70a91b12af2adccbbed821bf25ecc9 Author: Ufuk Celebi Date: 2016-06-03T13:24:22Z [FLINK-3872] [table, connector-kafka] Add KafkaTableSource --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-4016) FoldApplyWindowFunction is not properly initialized
RWenden created FLINK-4016: -- Summary: FoldApplyWindowFunction is not properly initialized Key: FLINK-4016 URL: https://issues.apache.org/jira/browse/FLINK-4016 Project: Flink Issue Type: Bug Components: DataStream API Affects Versions: 1.0.3 Reporter: RWenden Priority: Blocker Fix For: 1.1.0 FoldApplyWindowFunction's outputtype is not set. We're using constructions like (excerpt): .keyBy(0) .countWindow(10, 5) .fold(...) Running this stream gives an runtime exception in FoldApplyWindowFunction: "No initial value was serialized for the fold window function. Probably the setOutputType method was not called." This can be easily fixed in WindowedStream.java by (around line# 449): FoldApplyWindowFunction foldApplyWindowFunction = new FoldApplyWindowFunction<>(initialValue, foldFunction, function); foldApplyWindowFunction.setOutputType(resultType, input.getExecutionConfig()); operator = new EvictingWindowOperator<>(windowAssigner, windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()), keySel, input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()), stateDesc, new InternalIterableWindowFunction<>(foldApplyWindowFunction), trigger, evictor); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2002: Support for bz2 compression in flink-core
Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2002 This looks very handy. We should also update the formats table in the documentation. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#read-compressed-files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2066: Updated ssh configuration in base Dockerfile
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2066#discussion_r65717466 --- Diff: flink-contrib/docker-flink/base/Dockerfile --- @@ -38,12 +38,12 @@ ENV JAVA_HOME /usr/java/default/ RUN echo 'root:secret' | chpasswd #SSH as root... probably needs to be revised for security! -RUN sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' /etc/ssh/sshd_config --- End diff -- Can this be handled more flexibly using sed regular expressions such that any parameter is changed to `yes`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3921) StringParser not specifying encoding to use
[ https://issues.apache.org/jira/browse/FLINK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314172#comment-15314172 ] ASF GitHub Bot commented on FLINK-3921: --- Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2060 The only internal usage of `StringParser` is from `GenericCsvInputFormat`. Should we make the encoding configurable in `GenericCsvInputFormat` with a default of US-ASCII? This could then be overridden in an additional constructor of `StringParser`. Should the same changes be made to `StringValueParser`? @rekhajoshm @fhueske @StephanEwen :+1: :-1:? > StringParser not specifying encoding to use > --- > > Key: FLINK-3921 > URL: https://issues.apache.org/jira/browse/FLINK-3921 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.0.3 >Reporter: Tatu Saloranta >Assignee: Rekha Joshi >Priority: Trivial > > Class `flink.types.parser.StringParser` has javadocs indicating that contents > are expected to be Ascii, similar to `StringValueParser`. That makes sense, > but when constructing actual instance, no encoding is specified; on line 66 > f.ex: >this.result = new String(bytes, startPos+1, i - startPos - 2); > which leads to using whatever default platform encoding is. If contents > really are always Ascii (would not count on that as parser is used from CSV > reader), not a big deal, but it can lead to the usual Latin-1-VS-UTF-8 issues. > So I think that encoding should be explicitly specified, whatever is to be > used: javadocs claim ascii, so could be "us-ascii", but could well be UTF-8 > or even ISO-8859-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on the issue: https://github.com/apache/flink/pull/2060 The only internal usage of `StringParser` is from `GenericCsvInputFormat`. Should we make the encoding configurable in `GenericCsvInputFormat` with a default of US-ASCII? This could then be overridden in an additional constructor of `StringParser`. Should the same changes be made to `StringValueParser`? @rekhajoshm @fhueske @StephanEwen :+1: :-1:? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2068: [hotfix] [core] Fix scope format keys
GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/2068 [hotfix] [core] Fix scope format keys Fixes the scope format keys that were broken in 7ad8375a89374bec80571029e9166f1336bdea8e. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink metrics-hotfix-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2068.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2068 commit 57dda98c39a73720ceedfa2eb8564554cd7c245f Author: zentolDate: 2016-06-03T14:07:21Z [hotfix] [core] Fix scope format keys --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4013) GraphAlgorithms to simplify directed and undirected graphs
[ https://issues.apache.org/jira/browse/FLINK-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314136#comment-15314136 ] ASF GitHub Bot commented on FLINK-4013: --- GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/2067 [FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected graphs You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 4013_graphalgorithms_to_simplify_directed_and_undirected_graphs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2067.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2067 commit c27a21c2d9e4dcdcba2bb4689021a7e25e51d494 Author: Greg HoganDate: 2016-06-02T20:01:00Z [FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected graphs > GraphAlgorithms to simplify directed and undirected graphs > -- > > Key: FLINK-4013 > URL: https://issues.apache.org/jira/browse/FLINK-4013 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > Fix For: 1.1.0 > > > Create a directed {{GraphAlgorithm}} to remove self-loops and duplicate edges > and an undirected {{GraphAlgorithm}} to symmetrize and remove self-loops and > duplicate edges. > Remove {{RMatGraph.setSimpleGraph}} and the associated logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2067: [FLINK-4013] [gelly] GraphAlgorithms to simplify d...
GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/2067 [FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected graphs You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 4013_graphalgorithms_to_simplify_directed_and_undirected_graphs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2067.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2067 commit c27a21c2d9e4dcdcba2bb4689021a7e25e51d494 Author: Greg HoganDate: 2016-06-02T20:01:00Z [FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected graphs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2054: [FLINK-3763] RabbitMQ Source/Sink standardize connection ...
Github user subhankarb commented on the issue: https://github.com/apache/flink/pull/2054 Hi @rmetzger , would you plzzz review the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3763) RabbitMQ Source/Sink standardize connection parameters
[ https://issues.apache.org/jira/browse/FLINK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314095#comment-15314095 ] ASF GitHub Bot commented on FLINK-3763: --- Github user subhankarb commented on the issue: https://github.com/apache/flink/pull/2054 Hi @rmetzger , would you plzzz review the changes. > RabbitMQ Source/Sink standardize connection parameters > -- > > Key: FLINK-3763 > URL: https://issues.apache.org/jira/browse/FLINK-3763 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.0.1 >Reporter: Robert Batts >Assignee: Subhankar Biswas > > The RabbitMQ source and sink should have the same capabilities in terms of > establishing a connection, currently the sink is lacking connection > parameters that are available on the source. Additionally, VirtualHost should > be an offered parameter for multi-tenant RabbitMQ clusters (if not specified > it goes to the vhost '/'). > Connection Parameters > === > - Host - Offered on both > - Port - Source only > - Virtual Host - Neither > - User - Source only > - Password - Source only > Additionally, it might be worth offer the URI as a valid constructor because > that would offer all 5 of the above parameters in a single String. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3311) Add a connector for streaming data into Cassandra
[ https://issues.apache.org/jira/browse/FLINK-3311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314074#comment-15314074 ] ASF GitHub Bot commented on FLINK-3311: --- Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/1771 I've reviewed the connector again. The issues I've seen previously (failure on restart) are resolved. However, I found new issues: - The Cassandra Sink doesn't fail (at least not within 15 minutes) if Cassandra is not available anymore. Its probably just a configuration setting of the cassandra driver to fail after a certain amount of time. - We should probably introduce a (configurable) limit (nr. records / some gb's) for the write ahead log. It seemed to me, that due to the failed other instance, no checkpoints were able to complete anymore (because some of the cassandra sinks were stuck in the notifyCheckpointComplete()), while other's were accepting data into the WAL. This lead to a lot of data being written into the statebackend. I think the cassandra sink should stop at some point in such a situation. Also, I would like to test the exactly once behavior on a cluster more thoroughly. Currently, I've only tested whether the connector is properly failing and restoring, but I didn't test if the written data is actually correct. However, since the code seems to be working under normal operation, I would suggest to merge the connector now, and then file follow up JIRAs for the remaining issues. This makes collaboration and reviews easier and allows our users to help testing the cassandra connector. Some log: ``` 2016-06-03 12:28:36,478 ERROR org.apache.flink.streaming.runtime.operators.GenericWriteAheadSink - Error while sending value. com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive) at com.datastax.driver.core.exceptions.UnavailableException.copy(UnavailableException.java:128) at com.datastax.driver.core.Responses$Error.asException(Responses.java:114) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:477) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1005) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:928) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:50) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:266) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:246) at
[jira] [Updated] (FLINK-1730) Add a FlinkTools.persist style method to the Data Set.
[ https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-1730: --- Priority: Major (was: Minor) > Add a FlinkTools.persist style method to the Data Set. > -- > > Key: FLINK-1730 > URL: https://issues.apache.org/jira/browse/FLINK-1730 > Project: Flink > Issue Type: New Feature >Reporter: Stephan Ewen > > I think this is an operation that will be needed more prominently. Defining a > point where one long logical program is broken into different executions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes
[ https://issues.apache.org/jira/browse/FLINK-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314042#comment-15314042 ] Sebastian Klemke commented on FLINK-4015: - We use default retries: 0. Also, we can't set retries to a higher value, bc kafka documentation says "Allowing retries will potentially change the ordering of records because if two records are sent to a single partition, and the first fails and is retried but the second succeeds, then the second record may appear first." For our application, this must not happen. Repeating whole batches in-order would be okay. > FlinkKafkaProducer08 fails when partition leader changes > > > Key: FLINK-4015 > URL: https://issues.apache.org/jira/browse/FLINK-4015 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 1.0.2 >Reporter: Sebastian Klemke > > When leader for a partition changes, producer fails with the following > exception: > {code} > 06:34:50,813 INFO org.apache.flink.yarn.YarnJobManager >- Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to > FAILING. > java.lang.RuntimeException: Could not forward element to next operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51) > at OPERATOR.flatMap2(OPERATOR.java:82) > at OPERATOR.flatMap2(OPERATOR.java:16) > at > org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63) > at > org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207) > at > org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 10 more > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 13 more > Caused by: java.lang.Exception: Failed to send data to Kafka: This server is > not the leader for that topic-partition. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 16 more > Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: > This server is not the leader for that topic-partition. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure
[ https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314040#comment-15314040 ] ASF GitHub Bot commented on FLINK-4002: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/2063 you're right, that falls through. we should add additional checks: - Verify: index is equal to the size of expected values - Verify2: expected is empty > [py] Improve testing infraestructure > > > Key: FLINK-4002 > URL: https://issues.apache.org/jira/browse/FLINK-4002 > Project: Flink > Issue Type: Bug > Components: Python API >Affects Versions: 1.0.3 >Reporter: Omar Alvarez >Priority: Minor > Labels: Python, Testing > Original Estimate: 24h > Remaining Estimate: 24h > > The Verify() test function errors out when array elements are missing: > {code} > env.generate_sequence(1, 5)\ > .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output() > {code} > {quote} > IndexError: list index out of range > {quote} > There should also be more documentation in test functions. > I am already working on a pull request to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure
Github user zentol commented on the issue: https://github.com/apache/flink/pull/2063 you're right, that falls through. we should add additional checks: - Verify: index is equal to the size of expected values - Verify2: expected is empty --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure
Github user omaralvarez commented on the issue: https://github.com/apache/flink/pull/2063 Sorry, I said it wrong, it's the opposite. The case that fails in Verify() and Verify2(), is when we have more values in expected than in the resulting DataSet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure
[ https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314030#comment-15314030 ] ASF GitHub Bot commented on FLINK-4002: --- Github user omaralvarez commented on the issue: https://github.com/apache/flink/pull/2063 Sorry, I said it wrong, it's the opposite. The case that fails in Verify() and Verify2(), is when we have more values in expected than in the resulting DataSet. > [py] Improve testing infraestructure > > > Key: FLINK-4002 > URL: https://issues.apache.org/jira/browse/FLINK-4002 > Project: Flink > Issue Type: Bug > Components: Python API >Affects Versions: 1.0.3 >Reporter: Omar Alvarez >Priority: Minor > Labels: Python, Testing > Original Estimate: 24h > Remaining Estimate: 24h > > The Verify() test function errors out when array elements are missing: > {code} > env.generate_sequence(1, 5)\ > .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output() > {code} > {quote} > IndexError: list index out of range > {quote} > There should also be more documentation in test functions. > I am already working on a pull request to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure
Github user zentol commented on the issue: https://github.com/apache/flink/pull/2063 when we have more data than expected, remove() will be called on an empty list and should throw an exception, no? if you want to execute the python tests you only have to call mvn verify on the flink-python package. ``` cd flink-libraries/flink-python mvn verify ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure
[ https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314028#comment-15314028 ] ASF GitHub Bot commented on FLINK-4002: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/2063 when we have more data than expected, remove() will be called on an empty list and should throw an exception, no? if you want to execute the python tests you only have to call mvn verify on the flink-python package. ``` cd flink-libraries/flink-python mvn verify ``` > [py] Improve testing infraestructure > > > Key: FLINK-4002 > URL: https://issues.apache.org/jira/browse/FLINK-4002 > Project: Flink > Issue Type: Bug > Components: Python API >Affects Versions: 1.0.3 >Reporter: Omar Alvarez >Priority: Minor > Labels: Python, Testing > Original Estimate: 24h > Remaining Estimate: 24h > > The Verify() test function errors out when array elements are missing: > {code} > env.generate_sequence(1, 5)\ > .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output() > {code} > {quote} > IndexError: list index out of range > {quote} > There should also be more documentation in test functions. > I am already working on a pull request to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes
[ https://issues.apache.org/jira/browse/FLINK-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314026#comment-15314026 ] Robert Metzger commented on FLINK-4015: --- Hi Sebastian, do you have set the number of retires set to a value higher than 0 ? By default, its set to 0 so the producer will not retry in cause such an error happens. I would recommend to allow more retries. The reason why we don't set the number to something higher by default is, that retries can cause duplicate records in kafka. > FlinkKafkaProducer08 fails when partition leader changes > > > Key: FLINK-4015 > URL: https://issues.apache.org/jira/browse/FLINK-4015 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 1.0.2 >Reporter: Sebastian Klemke > > When leader for a partition changes, producer fails with the following > exception: > {code} > 06:34:50,813 INFO org.apache.flink.yarn.YarnJobManager >- Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to > FAILING. > java.lang.RuntimeException: Could not forward element to next operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51) > at OPERATOR.flatMap2(OPERATOR.java:82) > at OPERATOR.flatMap2(OPERATOR.java:16) > at > org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63) > at > org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207) > at > org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 10 more > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 13 more > Caused by: java.lang.Exception: Failed to send data to Kafka: This server is > not the leader for that topic-partition. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) > ... 16 more > Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: > This server is not the leader for that topic-partition. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure
Github user omaralvarez commented on the issue: https://github.com/apache/flink/pull/2063 I have corrected Verify2(), but there is another case that is not checked, when the resulting datasets have more elements than expected, right now the error will go unnoticed. I also wanted to ask, is there a way to execute only the python tests, since I want to unify the utilities in a file, but without knowing what is the execution path, I cannot make sure if the module will be imported correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure
[ https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314023#comment-15314023 ] ASF GitHub Bot commented on FLINK-4002: --- Github user omaralvarez commented on the issue: https://github.com/apache/flink/pull/2063 I have corrected Verify2(), but there is another case that is not checked, when the resulting datasets have more elements than expected, right now the error will go unnoticed. I also wanted to ask, is there a way to execute only the python tests, since I want to unify the utilities in a file, but without knowing what is the execution path, I cannot make sure if the module will be imported correctly. > [py] Improve testing infraestructure > > > Key: FLINK-4002 > URL: https://issues.apache.org/jira/browse/FLINK-4002 > Project: Flink > Issue Type: Bug > Components: Python API >Affects Versions: 1.0.3 >Reporter: Omar Alvarez >Priority: Minor > Labels: Python, Testing > Original Estimate: 24h > Remaining Estimate: 24h > > The Verify() test function errors out when array elements are missing: > {code} > env.generate_sequence(1, 5)\ > .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output() > {code} > {quote} > IndexError: list index out of range > {quote} > There should also be more documentation in test functions. > I am already working on a pull request to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes
Sebastian Klemke created FLINK-4015: --- Summary: FlinkKafkaProducer08 fails when partition leader changes Key: FLINK-4015 URL: https://issues.apache.org/jira/browse/FLINK-4015 Project: Flink Issue Type: Bug Components: Kafka Connector Affects Versions: 1.0.2 Reporter: Sebastian Klemke When leader for a partition changes, producer fails with the following exception: {code} 06:34:50,813 INFO org.apache.flink.yarn.YarnJobManager - Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to FAILING. java.lang.RuntimeException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51) at OPERATOR.flatMap2(OPERATOR.java:82) at OPERATOR.flatMap2(OPERATOR.java:16) at org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207) at org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) ... 10 more Caused by: java.lang.RuntimeException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) ... 13 more Caused by: java.lang.Exception: Failed to send data to Kafka: This server is not the leader for that topic-partition. at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249) at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351) ... 16 more Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314000#comment-15314000 ] ASF GitHub Bot commented on FLINK-3650: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/1856 I did not look at it in detail, just checked whether the builds passed. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API
Github user zentol commented on the issue: https://github.com/apache/flink/pull/1856 I did not look at it in detail, just checked whether the builds passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313995#comment-15313995 ] ASF GitHub Bot commented on FLINK-3650: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/1856 Thanks for the review @zentol . Ok. I will correct the line length. Does the overall approach look good? > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/1856 Thanks for the review @zentol . Ok. I will correct the line length. Does the overall approach look good? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313990#comment-15313990 ] ASF GitHub Bot commented on FLINK-3650: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/1856 Build is failing due to 54 scala style violations. (line length) > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API
Github user zentol commented on the issue: https://github.com/apache/flink/pull/1856 Build is failing due to 54 scala style violations. (line length) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313977#comment-15313977 ] ramkrishna.s.vasudevan commented on FLINK-3650: --- [~till.rohrmann] [~fhueske] Any chance of a review here? > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3919) Distributed Linear Algebra: row-based matrix
[ https://issues.apache.org/jira/browse/FLINK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313955#comment-15313955 ] Simone Robutti commented on FLINK-3919: --- Umh, probably you're right. I checked breeze and they use addition for matrix addition and sum for element-wise sum. > Distributed Linear Algebra: row-based matrix > > > Key: FLINK-3919 > URL: https://issues.apache.org/jira/browse/FLINK-3919 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Simone Robutti >Assignee: Simone Robutti > > Distributed matrix implementation as a DataSet of IndexedRow and related > operations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3919) Distributed Linear Algebra: row-based matrix
[ https://issues.apache.org/jira/browse/FLINK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313952#comment-15313952 ] ASF GitHub Bot commented on FLINK-3919: --- Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1996#discussion_r65688309 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.math.distributed + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.distributed.DistributedMatrix._ +import org.apache.flink.ml.math._ + +/** + * Distributed row-major matrix representation. + * @param numRows Number of rows. + * @param numCols Number of columns. + */ +class DistributedRowMatrix(val data: DataSet[IndexedRow], + val numRows: Int, + val numCols: Int) +extends DistributedMatrix { + + /** +* Collects the data in the form of a sequence of coordinates associated with their values. +* @return +*/ + def toCOO: Seq[(MatrixRowIndex, MatrixColIndex, Double)] = { + +val localRows = data.collect() + +for (IndexedRow(rowIndex, vector) <- localRows; + (columnIndex, value) <- vector) yield (rowIndex, columnIndex, value) + } + + /** +* Collects the data in the form of a SparseMatrix +* @return +*/ + def toLocalSparseMatrix: SparseMatrix = { +val localMatrix = + SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO) +require(localMatrix.numRows == this.numRows) +require(localMatrix.numCols == this.numCols) +localMatrix + } + + //TODO: convert to dense representation on the distributed matrix and collect it afterward + def toLocalDenseMatrix: DenseMatrix = this.toLocalSparseMatrix.toDenseMatrix + + /** +* Apply a high-order function to couple of rows +* @param fun +* @param other +* @return +*/ + def byRowOperation(fun: (Vector, Vector) => Vector, + other: DistributedRowMatrix): DistributedRowMatrix = { +val otherData = other.data +require(this.numCols == other.numCols) +require(this.numRows == other.numRows) + +val result = this.data + .fullOuterJoin(otherData) + .where("rowIndex") + .equalTo("rowIndex")( + (left: IndexedRow, right: IndexedRow) => { +val row1 = Option(left) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +right.rowIndex, +SparseVector.fromCOO(right.values.size, List((0, 0.0 +} +val row2 = Option(right) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +left.rowIndex, +SparseVector.fromCOO(left.values.size, List((0, 0.0 +} +IndexedRow(row1.rowIndex, fun(row1.values, row2.values)) + } + ) +new DistributedRowMatrix(result, numRows, numCols) + } + + /** +* Add the matrix to another matrix. +* @param other +* @return +*/ + def sum(other: DistributedRowMatrix): DistributedRowMatrix = { --- End diff -- Is `add` more proper name for this method? Please do not update this PR if you agree with me but just notify me because I'm rebasing this PR on current master. > Distributed Linear Algebra: row-based matrix > > > Key: FLINK-3919 > URL:
[GitHub] flink pull request #1996: [FLINK-3919][flink-ml] Distributed Linear Algebra:...
Github user chiwanpark commented on a diff in the pull request: https://github.com/apache/flink/pull/1996#discussion_r65688309 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.math.distributed + +import org.apache.flink.api.scala._ +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.distributed.DistributedMatrix._ +import org.apache.flink.ml.math._ + +/** + * Distributed row-major matrix representation. + * @param numRows Number of rows. + * @param numCols Number of columns. + */ +class DistributedRowMatrix(val data: DataSet[IndexedRow], + val numRows: Int, + val numCols: Int) +extends DistributedMatrix { + + /** +* Collects the data in the form of a sequence of coordinates associated with their values. +* @return +*/ + def toCOO: Seq[(MatrixRowIndex, MatrixColIndex, Double)] = { + +val localRows = data.collect() + +for (IndexedRow(rowIndex, vector) <- localRows; + (columnIndex, value) <- vector) yield (rowIndex, columnIndex, value) + } + + /** +* Collects the data in the form of a SparseMatrix +* @return +*/ + def toLocalSparseMatrix: SparseMatrix = { +val localMatrix = + SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO) +require(localMatrix.numRows == this.numRows) +require(localMatrix.numCols == this.numCols) +localMatrix + } + + //TODO: convert to dense representation on the distributed matrix and collect it afterward + def toLocalDenseMatrix: DenseMatrix = this.toLocalSparseMatrix.toDenseMatrix + + /** +* Apply a high-order function to couple of rows +* @param fun +* @param other +* @return +*/ + def byRowOperation(fun: (Vector, Vector) => Vector, + other: DistributedRowMatrix): DistributedRowMatrix = { +val otherData = other.data +require(this.numCols == other.numCols) +require(this.numRows == other.numRows) + +val result = this.data + .fullOuterJoin(otherData) + .where("rowIndex") + .equalTo("rowIndex")( + (left: IndexedRow, right: IndexedRow) => { +val row1 = Option(left) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +right.rowIndex, +SparseVector.fromCOO(right.values.size, List((0, 0.0 +} +val row2 = Option(right) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +left.rowIndex, +SparseVector.fromCOO(left.values.size, List((0, 0.0 +} +IndexedRow(row1.rowIndex, fun(row1.values, row2.values)) + } + ) +new DistributedRowMatrix(result, numRows, numCols) + } + + /** +* Add the matrix to another matrix. +* @param other +* @return +*/ + def sum(other: DistributedRowMatrix): DistributedRowMatrix = { --- End diff -- Is `add` more proper name for this method? Please do not update this PR if you agree with me but just notify me because I'm rebasing this PR on current master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-4014) Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Description: I also run a five jobs, and all the slot filled.And I seem to find my five jobs is all hanged. Following attachments are my jstack and job source code. The hang.stack display all thread related to the job is WAIT. Why? was: I also run a five jobs, and all the slot filled.And I seem to find my five jobs is all hanged. Following attachments are my jstack and job source code. The hang.stack display all thread related to the job is WAIT. > Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: FlinkJob_20160603_174748_03.java, hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code. > The hang.stack display all thread related to the job is WAIT. Why? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Summary: Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem (was: Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem) > Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: FlinkJob_20160603_174748_03.java, hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code. > The hang.stack display all thread related to the job is WAIT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Description: I also run a five jobs, and all the slot filled.And I seem to find my five jobs is all hanged. Following attachments are my jstack and job source code. The hang.stack display all thread related to the job is WAIT. was: I also run a five jobs, and all the slot filled.And I seem to find my five jobs is all hanged. Following attachments are my jstack and job source code。 > Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: FlinkJob_20160603_174748_03.java, hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code. > The hang.stack display all thread related to the job is WAIT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #2066: Updated ssh configuration in base Dockerfile
GitHub user techmaniack opened a pull request: https://github.com/apache/flink/pull/2066 Updated ssh configuration in base Dockerfile - The pull request addresses only one issue Won't allow ssh into the container as 'without-password' is now replaced with 'prohibit-password' in Xenial. You can merge this pull request into a Git repository by running: $ git pull https://github.com/techmaniack/flink patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2066.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2066 commit 768723ccf9cf2cab09905063cbf5f7ddf4296494 Author: AbdulKarim MemonDate: 2016-06-03T10:20:51Z Updated ssh configuration in base Dockerfile Won't allow ssh into the container as 'without-password' is now replaced with 'prohibit-password' in Xenial. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313944#comment-15313944 ] ZhengBowen commented on FLINK-4014: --- [~StephanEwen] can you help me? > Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: FlinkJob_20160603_174748_03.java, hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Attachment: FlinkJob_20160603_174748_03.java > Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: FlinkJob_20160603_174748_03.java, hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and Local、
ZhengBowen created FLINK-4014: - Summary: Jobs hang, Maybe NetworkBufferPool and Local、 Key: FLINK-4014 URL: https://issues.apache.org/jira/browse/FLINK-4014 Project: Flink Issue Type: Bug Reporter: ZhengBowen Attachments: hang.stack I also run a five jobs, and all the slot filled.And I seem to find my five jobs is all hanged. Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Attachment: hang.stack > Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Summary: Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem (was: Jobs hang, Maybe NetworkBufferPool and LocalBuffer) > Jobs hang, Maybe NetworkBufferPool and LocalBufferPool has some problem > --- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-4014) Jobs hang, Maybe NetworkBufferPool and LocalBuffer
[ https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhengBowen updated FLINK-4014: -- Summary: Jobs hang, Maybe NetworkBufferPool and LocalBuffer (was: Jobs hang, Maybe NetworkBufferPool and Local、) > Jobs hang, Maybe NetworkBufferPool and LocalBuffer > -- > > Key: FLINK-4014 > URL: https://issues.apache.org/jira/browse/FLINK-4014 > Project: Flink > Issue Type: Bug >Reporter: ZhengBowen > Attachments: hang.stack > > > I also run a five jobs, and all the slot filled.And I seem to find my five > jobs is all hanged. > Following attachments are my jstack and job source code。 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313938#comment-15313938 ] Simone Robutti commented on FLINK-1873: --- That would be perfect :) > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313915#comment-15313915 ] Chiwan Park commented on FLINK-1873: I think we don't need to hurry, but I'll review the first PR and merge it in 5 hours if there is no more problem. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313903#comment-15313903 ] Simone Robutti commented on FLINK-1873: --- Ok. Anyway tomorrow will be my first day of holiday and from wednesday I won't have continous access to the internet for 2 weeks. I hope to get the first PR merged before that day so that I could submit the second PR. Otherwise, for trivial corrections to the first PR, I will hand over to a colleague of mine for the 2 weeks I'm missing. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3937) Make flink cli list, savepoint, cancel and stop work on Flink-on-YARN clusters
[ https://issues.apache.org/jira/browse/FLINK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313890#comment-15313890 ] ASF GitHub Bot commented on FLINK-3937: --- Github user mxm commented on the issue: https://github.com/apache/flink/pull/2034 Thanks for the update! > Make flink cli list, savepoint, cancel and stop work on Flink-on-YARN clusters > -- > > Key: FLINK-3937 > URL: https://issues.apache.org/jira/browse/FLINK-3937 > Project: Flink > Issue Type: Improvement >Reporter: Sebastian Klemke >Assignee: Maximilian Michels >Priority: Trivial > Attachments: improve_flink_cli_yarn_integration.patch > > > Currently, flink cli can't figure out JobManager RPC location for > Flink-on-YARN clusters. Therefore, list, savepoint, cancel and stop > subcommands are hard to invoke if you only know the YARN application ID. As > an improvement, I suggest adding a -yid option to the > mentioned subcommands that can be used together with -m yarn-cluster. Flink > cli would then retrieve JobManager RPC location from YARN ResourceManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2034: [FLINK-3937] Implemented -yid option to Flink cli list, s...
Github user mxm commented on the issue: https://github.com/apache/flink/pull/2034 Thanks for the update! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3908) FieldParsers error state is not reset correctly to NONE
[ https://issues.apache.org/jira/browse/FLINK-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313872#comment-15313872 ] ASF GitHub Bot commented on FLINK-3908: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/2007 I'm not sure; in any case it should not be removed as part of this PR. You can open a separate JIRA or ask on the mailing list. > FieldParsers error state is not reset correctly to NONE > --- > > Key: FLINK-3908 > URL: https://issues.apache.org/jira/browse/FLINK-3908 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.0.2 >Reporter: Flavio Pompermaier >Assignee: Flavio Pompermaier > Labels: parser > > If during the parse of a csv there's a parse error (for example when in a > integer column there are non-int values) the errorState is not reset > correctly in the next parseField call. A simple fix would be to add as a > first statement of the {{parseField()}} function a call to > {{setErrorState(ParseErrorState.NONE)}} but it is something that should be > handled better (by default) for every subclass of {{FieldParser}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink issue #2007: [FLINK-3908] Fixed Parser's error state reset
Github user zentol commented on the issue: https://github.com/apache/flink/pull/2007 I'm not sure; in any case it should not be removed as part of this PR. You can open a separate JIRA or ask on the mailing list. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2007: [FLINK-3908] Fixed Parser's error state reset
Github user fpompermaier commented on the issue: https://github.com/apache/flink/pull/2007 Now it should be ok, according to your suggestions. I misunderstood what @StephanEwen was trying to say, thanks @zentol for the clarification! Just another thing: the method GenericCsvInputFormat.checkAndCoSort() is never used in the code. Do you want to keep it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3908) FieldParsers error state is not reset correctly to NONE
[ https://issues.apache.org/jira/browse/FLINK-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313864#comment-15313864 ] ASF GitHub Bot commented on FLINK-3908: --- Github user fpompermaier commented on the issue: https://github.com/apache/flink/pull/2007 Now it should be ok, according to your suggestions. I misunderstood what @StephanEwen was trying to say, thanks @zentol for the clarification! Just another thing: the method GenericCsvInputFormat.checkAndCoSort() is never used in the code. Do you want to keep it? > FieldParsers error state is not reset correctly to NONE > --- > > Key: FLINK-3908 > URL: https://issues.apache.org/jira/browse/FLINK-3908 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.0.2 >Reporter: Flavio Pompermaier >Assignee: Flavio Pompermaier > Labels: parser > > If during the parse of a csv there's a parse error (for example when in a > integer column there are non-int values) the errorState is not reset > correctly in the next parseField call. A simple fix would be to add as a > first statement of the {{parseField()}} function a call to > {{setErrorState(ParseErrorState.NONE)}} but it is something that should be > handled better (by default) for every subclass of {{FieldParser}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #1996: [FLINK-3919][flink-ml] Distributed Linear Algebra:...
Github user chobeat commented on a diff in the pull request: https://github.com/apache/flink/pull/1996#discussion_r65671113 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala --- @@ -0,0 +1,166 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.math.distributed + +import breeze.linalg.{CSCMatrix => BreezeSparseMatrix, Matrix => BreezeMatrix, Vector => BreezeVector} +import org.apache.flink.api.scala._ +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{Matrix => FlinkMatrix, _} + +/** + * Distributed row-major matrix representation. + * @param numRows Number of rows. + * @param numCols Number of columns. + */ +class DistributedRowMatrix(val data: DataSet[IndexedRow], + val numRows: Int, + val numCols: Int ) +extends DistributedMatrix { + + + + /** +* Collects the data in the form of a sequence of coordinates associated with their values. +* @return +*/ + def toCOO: Seq[(Int, Int, Double)] = { + +val localRows = data.collect() + +for (IndexedRow(rowIndex, vector) <- localRows; + (columnIndex, value) <- vector) yield (rowIndex, columnIndex, value) + } + + /** +* Collects the data in the form of a SparseMatrix +* @return +*/ + def toLocalSparseMatrix: SparseMatrix = { +val localMatrix = + SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO) +require(localMatrix.numRows == this.numRows) +require(localMatrix.numCols == this.numCols) +localMatrix + } + + //TODO: convert to dense representation on the distributed matrix and collect it afterward + def toLocalDenseMatrix: DenseMatrix = this.toLocalSparseMatrix.toDenseMatrix + + /** +* Apply a high-order function to couple of rows +* @param fun +* @param other +* @return +*/ + def byRowOperation(fun: (Vector, Vector) => Vector, + other: DistributedRowMatrix): DistributedRowMatrix = { +val otherData = other.data +require(this.numCols == other.numCols) +require(this.numRows == other.numRows) + +val result = this.data + .fullOuterJoin(otherData) + .where("rowIndex") + .equalTo("rowIndex")( + (left: IndexedRow, right: IndexedRow) => { +val row1 = Option(left) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +right.rowIndex, +SparseVector.fromCOO(right.values.size, List((0, 0.0 +} +val row2 = Option(right) match { + case Some(row: IndexedRow) => row + case None => +IndexedRow( +left.rowIndex, +SparseVector.fromCOO(left.values.size, List((0, 0.0 +} +IndexedRow(row1.rowIndex, fun(row1.values, row2.values)) + } + ) +new DistributedRowMatrix(result, numRows, numCols) + } + + /** +* Add the matrix to another matrix. +* @param other +* @return +*/ + def sum(other: DistributedRowMatrix): DistributedRowMatrix = { +val sumFunction: (Vector, Vector) => Vector = (x: Vector, y: Vector) => + (x.asBreeze + y.asBreeze).fromBreeze +this.byRowOperation(sumFunction, other) + } + + /** +* Subtracts another matrix. +* @param other +* @return +*/ + def subtract(other: DistributedRowMatrix): DistributedRowMatrix = { +val subFunction: (Vector, Vector) => Vector = (x: Vector, y: Vector) => + (x.asBreeze - y.asBreeze).fromBreeze +this.byRowOperation(subFunction, other) +
[GitHub] flink pull request #1517: [FLINK-3477] [runtime] Add hash-based combine stra...
Github user ggevay commented on a diff in the pull request: https://github.com/apache/flink/pull/1517#discussion_r65667687 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/operators/ReduceCombineDriver.java --- @@ -42,34 +44,38 @@ * Combine operator for Reduce functions, standalone (not chained). * Sorts and groups and reduces data, but never spills the sort. May produce multiple * partially aggregated groups. - * + * * @param The data type consumed and produced by the combiner. */ public class ReduceCombineDriver implements Driver{ - + private static final Logger LOG = LoggerFactory.getLogger(ReduceCombineDriver.class); /** Fix length records with a length below this threshold will be in-place sorted, if possible. */ private static final int THRESHOLD_FOR_IN_PLACE_SORTING = 32; - - + + private TaskContext taskContext; private TypeSerializer serializer; private TypeComparator comparator; - + private ReduceFunction reducer; - + private Collector output; - + + private DriverStrategy strategy; + private InMemorySorter sorter; - + private QuickSort sortAlgo = new QuickSort(); + private ReduceHashTable table; + private List memory; - private boolean running; + private volatile boolean canceled; --- End diff -- Sorry, I forgot the rename in the chained driver: 6abd3f3cf49568cc0fecd85d7e7d8a0d7f9ec39f And I forgot to invert the meaning with the rename in ReduceCombineDriver: 984ba12f44a7ee9b16790c3e172b53969448e1c2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3477) Add hash-based combine strategy for ReduceFunction
[ https://issues.apache.org/jira/browse/FLINK-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313791#comment-15313791 ] ASF GitHub Bot commented on FLINK-3477: --- Github user ggevay commented on a diff in the pull request: https://github.com/apache/flink/pull/1517#discussion_r65667687 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/operators/ReduceCombineDriver.java --- @@ -42,34 +44,38 @@ * Combine operator for Reduce functions, standalone (not chained). * Sorts and groups and reduces data, but never spills the sort. May produce multiple * partially aggregated groups. - * + * * @param The data type consumed and produced by the combiner. */ public class ReduceCombineDriver implements Driver{ - + private static final Logger LOG = LoggerFactory.getLogger(ReduceCombineDriver.class); /** Fix length records with a length below this threshold will be in-place sorted, if possible. */ private static final int THRESHOLD_FOR_IN_PLACE_SORTING = 32; - - + + private TaskContext taskContext; private TypeSerializer serializer; private TypeComparator comparator; - + private ReduceFunction reducer; - + private Collector output; - + + private DriverStrategy strategy; + private InMemorySorter sorter; - + private QuickSort sortAlgo = new QuickSort(); + private ReduceHashTable table; + private List memory; - private boolean running; + private volatile boolean canceled; --- End diff -- Sorry, I forgot the rename in the chained driver: 6abd3f3cf49568cc0fecd85d7e7d8a0d7f9ec39f And I forgot to invert the meaning with the rename in ReduceCombineDriver: 984ba12f44a7ee9b16790c3e172b53969448e1c2 > Add hash-based combine strategy for ReduceFunction > -- > > Key: FLINK-3477 > URL: https://issues.apache.org/jira/browse/FLINK-3477 > Project: Flink > Issue Type: Sub-task > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Gabor Gevay > > This issue is about adding a hash-based combine strategy for ReduceFunctions. > The interface of the {{reduce()}} method is as follows: > {code} > public T reduce(T v1, T v2) > {code} > Input type and output type are identical and the function returns only a > single value. A Reduce function is incrementally applied to compute a final > aggregated value. This allows to hold the preaggregated value in a hash-table > and update it with each function call. > The hash-based strategy requires special implementation of an in-memory hash > table. The hash table should support in place updates of elements (if the > updated value has the same size as the new value) but also appending updates > with invalidation of the old value (if the binary length of the new value > differs). The hash table needs to be able to evict and emit all elements if > it runs out-of-memory. > We should also add {{HASH}} and {{SORT}} compiler hints to > {{DataSet.reduce()}} and {{Grouping.reduce()}} to allow users to pick the > execution strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)