date:20160603

[GitHub] flink pull request #2071: [FLINK-4018](streaming-connectors) Configurable id...

2016-06-03 Thread tzulitai

GitHub user tzulitai opened a pull request:

https://github.com/apache/flink/pull/2071

[FLINK-4018](streaming-connectors) Configurable idle time between 
getRecords requests to Kinesis shards

Along with this new configuration and the already existing 
`KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more 
control on the desired throughput behaviour for the Kinesis consumer.

The default value for this new configuration is 500 milliseconds idle time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tzulitai/flink FLINK-4018

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2071


commit 74fcb62f62007c0396d2ae870298f35ae5ce
Author: Gordon Tai 
Date:   2016-06-04T04:43:30Z

[FLINK-4018] Add configuration for idle time between get requests to 
Kinesis shards




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315316#comment-15315316
 ] 

ASF GitHub Bot commented on FLINK-4018:
---

GitHub user tzulitai opened a pull request:

https://github.com/apache/flink/pull/2071

[FLINK-4018](streaming-connectors) Configurable idle time between 
getRecords requests to Kinesis shards

Along with this new configuration and the already existing 
`KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more 
control on the desired throughput behaviour for the Kinesis consumer.

The default value for this new configuration is 500 milliseconds idle time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tzulitai/flink FLINK-4018

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2071


commit 74fcb62f62007c0396d2ae870298f35ae5ce
Author: Gordon Tai 
Date:   2016-06-04T04:43:30Z

[FLINK-4018] Add configuration for idle time between get requests to 
Kinesis shards




> Configurable idle time between getRecords requests to Kinesis shards
> 
>
> Key: FLINK-4018
> URL: https://issues.apache.org/jira/browse/FLINK-4018
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming Connectors
>Affects Versions: 1.1.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>
> Currently, the Kinesis consumer is calling getRecords() right after finishing 
> previous calls. This results in easily reaching Amazon's limitation of 5 GET 
> requests per shard per second. Although the code already has backoff & retry 
> mechanism, this will affect other applications consuming from the same 
> Kinesis stream.
> Along with this new configuration and the already existing 
> `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more 
> control on the desired throughput behaviour for the Kinesis consumer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards

2016-06-03 Thread Tzu-Li (Gordon) Tai (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315290#comment-15315290
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-4018:


We have a user who has tried out the connector and preparing for production 
next week with Flink & Kinesis. The only thing worrying is this issue. I'll PR 
within the next 24 hours for review.

> Configurable idle time between getRecords requests to Kinesis shards
> 
>
> Key: FLINK-4018
> URL: https://issues.apache.org/jira/browse/FLINK-4018
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming Connectors
>Affects Versions: 1.1.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>
> Currently, the Kinesis consumer is calling getRecords() right after finishing 
> previous calls. This results in easily reaching Amazon's limitation of 5 GET 
> requests per shard per second. Although the code already has backoff & retry 
> mechanism, this will affect other applications consuming from the same 
> Kinesis stream.
> Along with this new configuration and the already existing 
> `KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more 
> control on the desired throughput behaviour for the Kinesis consumer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4018) Configurable idle time between getRecords requests to Kinesis shards

2016-06-03 Thread Tzu-Li (Gordon) Tai (JIRA)

Tzu-Li (Gordon) Tai created FLINK-4018:
--

 Summary: Configurable idle time between getRecords requests to 
Kinesis shards
 Key: FLINK-4018
 URL: https://issues.apache.org/jira/browse/FLINK-4018
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming Connectors
Affects Versions: 1.1.0
Reporter: Tzu-Li (Gordon) Tai
Assignee: Tzu-Li (Gordon) Tai


Currently, the Kinesis consumer is calling getRecords() right after finishing 
previous calls. This results in easily reaching Amazon's limitation of 5 GET 
requests per shard per second. Although the code already has backoff & retry 
mechanism, this will affect other applications consuming from the same Kinesis 
stream.

Along with this new configuration and the already existing 
`KinesisConfigConstants.CONFIG_SHARD_RECORDS_PER_GET`, users will have more 
control on the desired throughput behaviour for the Kinesis consumer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1979) Implement Loss Functions

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315212#comment-15315212
 ] 

ASF GitHub Bot commented on FLINK-1979:
---

Github user skavulya commented on the issue:

https://github.com/apache/flink/pull/1985
  
@chiwanpark Decoupling the gradient descent step is complicated for L1 
regularization because we are using the proximal gradient method that applies 
soft thresholding after executing the gradient descent step. I left the 
regularization penalty as-is. I am thinking of adding an additional method that 
adds the regularization penalty to gradient without the gradient descent step 
but I will do it in the L-BFGS PR instead.


> Implement Loss Functions
> 
>
> Key: FLINK-1979
> URL: https://issues.apache.org/jira/browse/FLINK-1979
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Johannes Günther
>Assignee: Johannes Günther
>Priority: Minor
>  Labels: ML
>
> For convex optimization problems, optimizer methods like SGD rely on a 
> pluggable implementation of a loss function and its first derivative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #1985: [FLINK-1979] Add logistic loss, hinge loss and regulariza...

2016-06-03 Thread skavulya

Github user skavulya commented on the issue:

https://github.com/apache/flink/pull/1985
  
@chiwanpark Decoupling the gradient descent step is complicated for L1 
regularization because we are using the proximal gradient method that applies 
soft thresholding after executing the gradient descent step. I left the 
regularization penalty as-is. I am thinking of adding an additional method that 
adds the regularization penalty to gradient without the gradient descent step 
but I will do it in the L-BFGS PR instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3872) Add Kafka TableSource with JSON serialization

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314656#comment-15314656
 ] 

ASF GitHub Bot commented on FLINK-3872:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2069
  
Quickly checked the maven and kafka stuff. Looks good. Very clean, well 
documented and tested code.
Further review for the table api specific parts are still needed.


> Add Kafka TableSource with JSON serialization
> -
>
> Key: FLINK-3872
> URL: https://issues.apache.org/jira/browse/FLINK-3872
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API
>Reporter: Fabian Hueske
>Assignee: Ufuk Celebi
> Fix For: 1.1.0
>
>
> Add a Kafka TableSource which reads JSON serialized data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2069: [FLINK-3872] [table, connector-kafka] Add KafkaJsonTableS...

2016-06-03 Thread rmetzger

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/2069
  
Quickly checked the maven and kafka stuff. Looks good. Very clean, well 
documented and tested code.
Further review for the table api specific parts are still needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3850) Add forward field annotations to DataSet operators generated by the Table API

2016-06-03 Thread ramkrishna.s.vasudevan (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314630#comment-15314630
 ] 

ramkrishna.s.vasudevan commented on FLINK-3850:
---

[~fhueske]
Is it possible that I could work on this issue, with your guidance and 
feedback?  Let me know what you think.

> Add forward field annotations to DataSet operators generated by the Table API
> -
>
> Key: FLINK-3850
> URL: https://issues.apache.org/jira/browse/FLINK-3850
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API
>Reporter: Fabian Hueske
>
> The DataSet API features semantic annotations [1] to hint the optimizer which 
> input fields an operator copies. This information is valuable for the 
> optimizer because it can infer that certain physical properties such as 
> partitioning or sorting are not destroyed by user functions and thus generate 
> more efficient execution plans.
> The Table API is built on top of the DataSet API and generates DataSet 
> programs and code for user-defined functions. Hence, it knows exactly which 
> fields are modified and which not. We should use this information to 
> automatically generate forward field annotations and attach them to the 
> operators. This can help to significantly improve the performance of certain 
> jobs.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ramkrish86

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/1856
  
Thanks @zentol . I pushed a new one after wrapping the longer lines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314611#comment-15314611
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/1856
  
Thanks @zentol . I pushed a new one after wrapping the longer lines. 


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4016) FoldApplyWindowFunction is not properly initialized

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314326#comment-15314326
 ] 

ASF GitHub Bot commented on FLINK-4016:
---

GitHub user rvdwenden opened a pull request:

https://github.com/apache/flink/pull/2070

[FLINK-4016] initialize FoldApplyWindowFunction properly





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvdwenden/flink master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2070


commit ec5eb4890201d6d0c9d91510f3b078a681d5a317
Author: rvdwenden 
Date:   2016-06-03T15:49:03Z

[FLINK-4016] initialize FoldApplyWindowFunction properly




> FoldApplyWindowFunction is not properly initialized
> ---
>
> Key: FLINK-4016
> URL: https://issues.apache.org/jira/browse/FLINK-4016
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API
>Affects Versions: 1.0.3
>Reporter: RWenden
>Priority: Blocker
>  Labels: easyfix
> Fix For: 1.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> FoldApplyWindowFunction's outputtype is not set.
> We're using constructions like (excerpt):
>   .keyBy(0)
>   .countWindow(10, 5)
>   .fold(...)
> Running this stream gives an runtime exception in FoldApplyWindowFunction:
> "No initial value was serialized for the fold window function. Probably the 
> setOutputType method was not called."
> This can be easily fixed in WindowedStream.java by (around line# 449):
> FoldApplyWindowFunction foldApplyWindowFunction = new 
> FoldApplyWindowFunction<>(initialValue, foldFunction, function);
> foldApplyWindowFunction.setOutputType(resultType, 
> input.getExecutionConfig());
> operator = new EvictingWindowOperator<>(windowAssigner,
> 
> windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
> keySel,
> 
> input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
> stateDesc,
> new 
> InternalIterableWindowFunction<>(foldApplyWindowFunction),
> trigger,
> evictor);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #2070: [FLINK-4016] initialize FoldApplyWindowFunction pr...

2016-06-03 Thread rvdwenden

GitHub user rvdwenden opened a pull request:

https://github.com/apache/flink/pull/2070

[FLINK-4016] initialize FoldApplyWindowFunction properly





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvdwenden/flink master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2070


commit ec5eb4890201d6d0c9d91510f3b078a681d5a317
Author: rvdwenden 
Date:   2016-06-03T15:49:03Z

[FLINK-4016] initialize FoldApplyWindowFunction properly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-4017) [py] Add Aggregation support to Python API

2016-06-03 Thread Geoffrey Mon (JIRA)

Geoffrey Mon created FLINK-4017:
---

 Summary: [py] Add Aggregation support to Python API
 Key: FLINK-4017
 URL: https://issues.apache.org/jira/browse/FLINK-4017
 Project: Flink
  Issue Type: Improvement
  Components: Python API
Reporter: Geoffrey Mon
Priority: Minor


Aggregations are not currently supported in the Python API.

I was getting started with setting up and working with Flink and figured this 
would be a relatively simple task for me to get started with. Currently working 
on this at https://github.com/geofbot/flink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3872) Add Kafka TableSource with JSON serialization

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314238#comment-15314238
 ] 

ASF GitHub Bot commented on FLINK-3872:
---

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/2069

[FLINK-3872] [table, connector-kafka] Add KafkaJsonTableSource

Adds `StreamTableSource` variants for Kafka with syntactic sugar for 
parsing JSON streams.

```java
KafkaJsonTableSource source = new Kafka08JsonTableSource(
topic,
props,
new String[] { "id" }, // field names
new Class[] { Long.class }); // field types

tableEnvironment.registerTableSource("kafka-stream", source)
```

You can then continue to work with the stream:

```java
Table result = tableEnvironment.ingest("kafka-stream").filter("id > 1000");
tableEnvironment.toDataStream(result, Row.class).print();
```

**Limitations**
- Assumes flat JSON field access (we can easily extend this to use JSON 
pointers, allowing us to parse nested fields like `/location/area` as field 
names).
- This does not extract any timestamp or watermarks (not an issue right now 
as the Table API currently does not support operations where this is needed).
- API is kind of cumbersome and non Scalaesque for the Scala Table API.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 3872-kafkajson_table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2069.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2069


commit 12ec6a594d23bd36bed1e07eeaba2aa75a768f67
Author: Ufuk Celebi 
Date:   2016-06-02T20:38:23Z

[FLINK-3872] [table, connector-kafka] Add JsonRowDeserializationSchema

- Adds a deserialization schema from byte[] to Row to be used in conjunction
  with the Table API.

commit a8dc3aa7ab70a91b12af2adccbbed821bf25ecc9
Author: Ufuk Celebi 
Date:   2016-06-03T13:24:22Z

[FLINK-3872] [table, connector-kafka] Add KafkaTableSource




> Add Kafka TableSource with JSON serialization
> -
>
> Key: FLINK-3872
> URL: https://issues.apache.org/jira/browse/FLINK-3872
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API
>Reporter: Fabian Hueske
>Assignee: Ufuk Celebi
> Fix For: 1.1.0
>
>
> Add a Kafka TableSource which reads JSON serialized data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #2069: [FLINK-3872] [table, connector-kafka] Add KafkaJso...

2016-06-03 Thread uce

GitHub user uce opened a pull request:

https://github.com/apache/flink/pull/2069

[FLINK-3872] [table, connector-kafka] Add KafkaJsonTableSource

Adds `StreamTableSource` variants for Kafka with syntactic sugar for 
parsing JSON streams.

```java
KafkaJsonTableSource source = new Kafka08JsonTableSource(
topic,
props,
new String[] { "id" }, // field names
new Class[] { Long.class }); // field types

tableEnvironment.registerTableSource("kafka-stream", source)
```

You can then continue to work with the stream:

```java
Table result = tableEnvironment.ingest("kafka-stream").filter("id > 1000");
tableEnvironment.toDataStream(result, Row.class).print();
```

**Limitations**
- Assumes flat JSON field access (we can easily extend this to use JSON 
pointers, allowing us to parse nested fields like `/location/area` as field 
names).
- This does not extract any timestamp or watermarks (not an issue right now 
as the Table API currently does not support operations where this is needed).
- API is kind of cumbersome and non Scalaesque for the Scala Table API.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uce/flink 3872-kafkajson_table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2069.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2069


commit 12ec6a594d23bd36bed1e07eeaba2aa75a768f67
Author: Ufuk Celebi 
Date:   2016-06-02T20:38:23Z

[FLINK-3872] [table, connector-kafka] Add JsonRowDeserializationSchema

- Adds a deserialization schema from byte[] to Row to be used in conjunction
  with the Table API.

commit a8dc3aa7ab70a91b12af2adccbbed821bf25ecc9
Author: Ufuk Celebi 
Date:   2016-06-03T13:24:22Z

[FLINK-3872] [table, connector-kafka] Add KafkaTableSource




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-4016) FoldApplyWindowFunction is not properly initialized

2016-06-03 Thread RWenden (JIRA)

RWenden created FLINK-4016:
--

 Summary: FoldApplyWindowFunction is not properly initialized
 Key: FLINK-4016
 URL: https://issues.apache.org/jira/browse/FLINK-4016
 Project: Flink
  Issue Type: Bug
  Components: DataStream API
Affects Versions: 1.0.3
Reporter: RWenden
Priority: Blocker
 Fix For: 1.1.0


FoldApplyWindowFunction's outputtype is not set.

We're using constructions like (excerpt):
  .keyBy(0)
  .countWindow(10, 5)
  .fold(...)
Running this stream gives an runtime exception in FoldApplyWindowFunction:
"No initial value was serialized for the fold window function. Probably the 
setOutputType method was not called."

This can be easily fixed in WindowedStream.java by (around line# 449):
FoldApplyWindowFunction foldApplyWindowFunction = new 
FoldApplyWindowFunction<>(initialValue, foldFunction, function);
foldApplyWindowFunction.setOutputType(resultType, 
input.getExecutionConfig());

operator = new EvictingWindowOperator<>(windowAssigner,

windowAssigner.getWindowSerializer(getExecutionEnvironment().getConfig()),
keySel,

input.getKeyType().createSerializer(getExecutionEnvironment().getConfig()),
stateDesc,
new 
InternalIterableWindowFunction<>(foldApplyWindowFunction),
trigger,
evictor);





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

2016-06-03 Thread greghogan

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2002
  
This looks very handy. We should also update the formats table in the 
documentation.


https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#read-compressed-files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2066: Updated ssh configuration in base Dockerfile

2016-06-03 Thread greghogan

Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/2066#discussion_r65717466
  
--- Diff: flink-contrib/docker-flink/base/Dockerfile ---
@@ -38,12 +38,12 @@ ENV JAVA_HOME /usr/java/default/
 RUN echo 'root:secret' | chpasswd
 
 #SSH as root... probably needs to be revised for security!
-RUN sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' 
/etc/ssh/sshd_config
--- End diff --

Can this be handled more flexibly using sed regular expressions such that 
any parameter is changed to `yes`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3921) StringParser not specifying encoding to use

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314172#comment-15314172
 ] 

ASF GitHub Bot commented on FLINK-3921:
---

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
The only internal usage of `StringParser` is from `GenericCsvInputFormat`. 
Should we make the encoding configurable in `GenericCsvInputFormat` with a 
default of US-ASCII? This could then be overridden in an additional constructor 
of `StringParser`.

Should the same changes be made to `StringValueParser`?

@rekhajoshm @fhueske @StephanEwen :+1: :-1:?


> StringParser not specifying encoding to use
> ---
>
> Key: FLINK-3921
> URL: https://issues.apache.org/jira/browse/FLINK-3921
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.0.3
>Reporter: Tatu Saloranta
>Assignee: Rekha Joshi
>Priority: Trivial
>
> Class `flink.types.parser.StringParser` has javadocs indicating that contents 
> are expected to be Ascii, similar to `StringValueParser`. That makes sense, 
> but when constructing actual instance, no encoding is specified; on line 66 
> f.ex:
>this.result = new String(bytes, startPos+1, i - startPos - 2);
> which leads to using whatever default platform encoding is. If contents 
> really are always Ascii (would not count on that as parser is used from CSV 
> reader), not a big deal, but it can lead to the usual Latin-1-VS-UTF-8 issues.
> So I think that encoding should be explicitly specified, whatever is to be 
> used: javadocs claim ascii, so could be "us-ascii", but could well be UTF-8 
> or even ISO-8859-1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

2016-06-03 Thread greghogan

Github user greghogan commented on the issue:

https://github.com/apache/flink/pull/2060
  
The only internal usage of `StringParser` is from `GenericCsvInputFormat`. 
Should we make the encoding configurable in `GenericCsvInputFormat` with a 
default of US-ASCII? This could then be overridden in an additional constructor 
of `StringParser`.

Should the same changes be made to `StringValueParser`?

@rekhajoshm @fhueske @StephanEwen :+1: :-1:?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2068: [hotfix] [core] Fix scope format keys

2016-06-03 Thread zentol

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/2068

[hotfix] [core] Fix scope format keys

Fixes the scope format keys that were broken in 
7ad8375a89374bec80571029e9166f1336bdea8e.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink metrics-hotfix-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2068


commit 57dda98c39a73720ceedfa2eb8564554cd7c245f
Author: zentol 
Date:   2016-06-03T14:07:21Z

[hotfix] [core] Fix scope format keys




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4013) GraphAlgorithms to simplify directed and undirected graphs

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314136#comment-15314136
 ] 

ASF GitHub Bot commented on FLINK-4013:
---

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/2067

[FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected 
graphs



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
4013_graphalgorithms_to_simplify_directed_and_undirected_graphs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2067


commit c27a21c2d9e4dcdcba2bb4689021a7e25e51d494
Author: Greg Hogan 
Date:   2016-06-02T20:01:00Z

[FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected 
graphs




> GraphAlgorithms to simplify directed and undirected graphs
> --
>
> Key: FLINK-4013
> URL: https://issues.apache.org/jira/browse/FLINK-4013
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
> Fix For: 1.1.0
>
>
> Create a directed {{GraphAlgorithm}} to remove self-loops and duplicate edges 
> and an undirected {{GraphAlgorithm}} to symmetrize and remove self-loops and 
> duplicate edges.
> Remove {{RMatGraph.setSimpleGraph}} and the associated logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #2067: [FLINK-4013] [gelly] GraphAlgorithms to simplify d...

2016-06-03 Thread greghogan

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/2067

[FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected 
graphs



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
4013_graphalgorithms_to_simplify_directed_and_undirected_graphs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2067


commit c27a21c2d9e4dcdcba2bb4689021a7e25e51d494
Author: Greg Hogan 
Date:   2016-06-02T20:01:00Z

[FLINK-4013] [gelly] GraphAlgorithms to simplify directed and undirected 
graphs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2054: [FLINK-3763] RabbitMQ Source/Sink standardize connection ...

2016-06-03 Thread subhankarb

Github user subhankarb commented on the issue:

https://github.com/apache/flink/pull/2054
  
Hi @rmetzger ,
would you plzzz review the changes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3763) RabbitMQ Source/Sink standardize connection parameters

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314095#comment-15314095
 ] 

ASF GitHub Bot commented on FLINK-3763:
---

Github user subhankarb commented on the issue:

https://github.com/apache/flink/pull/2054
  
Hi @rmetzger ,
would you plzzz review the changes.



> RabbitMQ Source/Sink standardize connection parameters
> --
>
> Key: FLINK-3763
> URL: https://issues.apache.org/jira/browse/FLINK-3763
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.0.1
>Reporter: Robert Batts
>Assignee: Subhankar Biswas
>
> The RabbitMQ source and sink should have the same capabilities in terms of 
> establishing a connection, currently the sink is lacking connection 
> parameters that are available on the source. Additionally, VirtualHost should 
> be an offered parameter for multi-tenant RabbitMQ clusters (if not specified 
> it goes to the vhost '/').
> Connection Parameters
> ===
> - Host - Offered on both
> - Port - Source only
> - Virtual Host - Neither
> - User - Source only
> - Password - Source only
> Additionally, it might be worth offer the URI as a valid constructor because 
> that would offer all 5 of the above parameters in a single String.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3311) Add a connector for streaming data into Cassandra

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314074#comment-15314074
 ] 

ASF GitHub Bot commented on FLINK-3311:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/1771
  
I've reviewed the connector again.
The issues I've seen previously (failure on restart) are resolved.
However, I found new issues:
- The Cassandra Sink doesn't fail (at least not within 15 minutes) if 
Cassandra is not available anymore. Its probably just a configuration setting 
of the cassandra driver to fail after a certain amount of time.
- We should probably introduce a (configurable) limit (nr. records / some 
gb's) for the write ahead log. It seemed to me, that due to the failed other 
instance, no checkpoints were able to complete anymore (because some of the 
cassandra sinks were stuck in the notifyCheckpointComplete()), while other's 
were accepting data into the WAL. This lead to a lot of data being written into 
the statebackend. I think the cassandra sink should stop at some point in such 
a situation.

Also, I would like to test the exactly once behavior on a cluster more 
thoroughly. Currently, I've only tested whether the connector is properly 
failing and restoring, but I didn't test if the written data is actually 
correct.

However, since the code seems to be working under normal operation, I would 
suggest to merge the connector now, and then file follow up JIRAs for the 
remaining issues.
This makes collaboration and reviews easier and allows our users to help 
testing the cassandra connector.



Some log:
```
2016-06-03 12:28:36,478 ERROR 
org.apache.flink.streaming.runtime.operators.GenericWriteAheadSink  - Error 
while sending value.
com.datastax.driver.core.exceptions.UnavailableException: Not enough 
replicas available for query at consistency LOCAL_ONE (1 required but only 0 
alive)
at 
com.datastax.driver.core.exceptions.UnavailableException.copy(UnavailableException.java:128)
at 
com.datastax.driver.core.Responses$Error.asException(Responses.java:114)
at 
com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:477)
at 
com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1005)
at 
com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:928)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618)
at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not 
enough replicas available for query at consistency LOCAL_ONE (1 required but 
only 0 alive)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:50)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
at 
com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:266)
at 
com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:246)
at

[jira] [Updated] (FLINK-1730) Add a FlinkTools.persist style method to the Data Set.

2016-06-03 Thread Gabor Gevay (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-1730:
---
Priority: Major  (was: Minor)

> Add a FlinkTools.persist style method to the Data Set.
> --
>
> Key: FLINK-1730
> URL: https://issues.apache.org/jira/browse/FLINK-1730
> Project: Flink
>  Issue Type: New Feature
>Reporter: Stephan Ewen
>
> I think this is an operation that will be needed more prominently. Defining a 
> point where one long logical program is broken into different executions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes

2016-06-03 Thread Sebastian Klemke (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314042#comment-15314042
 ] 

Sebastian Klemke commented on FLINK-4015:
-

We use default retries: 0. Also, we can't set retries to a higher value, bc 
kafka documentation says "Allowing retries will potentially change the ordering 
of records because if two records are sent to a single partition, and the first 
fails and is retried but the second succeeds, then the second record may appear 
first." For our application, this must not happen. Repeating whole batches 
in-order would be okay.

> FlinkKafkaProducer08 fails when partition leader changes
> 
>
> Key: FLINK-4015
> URL: https://issues.apache.org/jira/browse/FLINK-4015
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.0.2
>Reporter: Sebastian Klemke
>
> When leader for a partition changes, producer fails with the following 
> exception:
> {code}
> 06:34:50,813 INFO  org.apache.flink.yarn.YarnJobManager   
>- Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to 
> FAILING.
> java.lang.RuntimeException: Could not forward element to next operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
>   at OPERATOR.flatMap2(OPERATOR.java:82)
>   at OPERATOR.flatMap2(OPERATOR.java:16)
>   at 
> org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 10 more
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 13 more
> Caused by: java.lang.Exception: Failed to send data to Kafka: This server is 
> not the leader for that topic-partition.
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249)
>   at 
> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 16 more
> Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: 
> This server is not the leader for that topic-partition.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314040#comment-15314040
 ] 

ASF GitHub Bot commented on FLINK-4002:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2063
  
you're right, that falls through. we should add additional checks:
- Verify: index is equal to the size of expected values
- Verify2: expected is empty


> [py] Improve testing infraestructure
> 
>
> Key: FLINK-4002
> URL: https://issues.apache.org/jira/browse/FLINK-4002
> Project: Flink
>  Issue Type: Bug
>  Components: Python API
>Affects Versions: 1.0.3
>Reporter: Omar Alvarez
>Priority: Minor
>  Labels: Python, Testing
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Verify() test function errors out when array elements are missing:
> {code}
> env.generate_sequence(1, 5)\
>  .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output()
> {code}
> {quote}
> IndexError: list index out of range
> {quote}
> There should also be more documentation in test functions.
> I am already working on a pull request to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure

2016-06-03 Thread zentol

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2063
  
you're right, that falls through. we should add additional checks:
- Verify: index is equal to the size of expected values
- Verify2: expected is empty


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure

2016-06-03 Thread omaralvarez

Github user omaralvarez commented on the issue:

https://github.com/apache/flink/pull/2063
  
Sorry, I said it wrong, it's the opposite. The case that fails in Verify() 
and Verify2(), is when we have more values in expected than in the resulting 
DataSet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314030#comment-15314030
 ] 

ASF GitHub Bot commented on FLINK-4002:
---

Github user omaralvarez commented on the issue:

https://github.com/apache/flink/pull/2063
  
Sorry, I said it wrong, it's the opposite. The case that fails in Verify() 
and Verify2(), is when we have more values in expected than in the resulting 
DataSet.


> [py] Improve testing infraestructure
> 
>
> Key: FLINK-4002
> URL: https://issues.apache.org/jira/browse/FLINK-4002
> Project: Flink
>  Issue Type: Bug
>  Components: Python API
>Affects Versions: 1.0.3
>Reporter: Omar Alvarez
>Priority: Minor
>  Labels: Python, Testing
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Verify() test function errors out when array elements are missing:
> {code}
> env.generate_sequence(1, 5)\
>  .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output()
> {code}
> {quote}
> IndexError: list index out of range
> {quote}
> There should also be more documentation in test functions.
> I am already working on a pull request to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure

2016-06-03 Thread zentol

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2063
  
when we have more data than expected, remove() will be called on an empty 
list and should throw an exception, no?

if you want to execute the python tests you only have to call mvn verify on 
the flink-python package.
```
cd flink-libraries/flink-python
mvn verify
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314028#comment-15314028
 ] 

ASF GitHub Bot commented on FLINK-4002:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2063
  
when we have more data than expected, remove() will be called on an empty 
list and should throw an exception, no?

if you want to execute the python tests you only have to call mvn verify on 
the flink-python package.
```
cd flink-libraries/flink-python
mvn verify
```


> [py] Improve testing infraestructure
> 
>
> Key: FLINK-4002
> URL: https://issues.apache.org/jira/browse/FLINK-4002
> Project: Flink
>  Issue Type: Bug
>  Components: Python API
>Affects Versions: 1.0.3
>Reporter: Omar Alvarez
>Priority: Minor
>  Labels: Python, Testing
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Verify() test function errors out when array elements are missing:
> {code}
> env.generate_sequence(1, 5)\
>  .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output()
> {code}
> {quote}
> IndexError: list index out of range
> {quote}
> There should also be more documentation in test functions.
> I am already working on a pull request to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes

2016-06-03 Thread Robert Metzger (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314026#comment-15314026
 ] 

Robert Metzger commented on FLINK-4015:
---

Hi Sebastian,
do you have set the number of retires set to a value higher than 0 ?
By default, its set to 0 so the producer will not retry in cause such an error 
happens.
I would recommend to allow more retries. The reason why we don't set the number 
to something higher by default is, that retries can cause duplicate records in 
kafka.

> FlinkKafkaProducer08 fails when partition leader changes
> 
>
> Key: FLINK-4015
> URL: https://issues.apache.org/jira/browse/FLINK-4015
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.0.2
>Reporter: Sebastian Klemke
>
> When leader for a partition changes, producer fails with the following 
> exception:
> {code}
> 06:34:50,813 INFO  org.apache.flink.yarn.YarnJobManager   
>- Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to 
> FAILING.
> java.lang.RuntimeException: Could not forward element to next operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
>   at OPERATOR.flatMap2(OPERATOR.java:82)
>   at OPERATOR.flatMap2(OPERATOR.java:16)
>   at 
> org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207)
>   at 
> org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 10 more
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 13 more
> Caused by: java.lang.Exception: Failed to send data to Kafka: This server is 
> not the leader for that topic-partition.
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249)
>   at 
> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
>   ... 16 more
> Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: 
> This server is not the leader for that topic-partition.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2063: [FLINK-4002] [py] Improve testing infraestructure

2016-06-03 Thread omaralvarez

Github user omaralvarez commented on the issue:

https://github.com/apache/flink/pull/2063
  
I have corrected Verify2(), but there is another case that is not checked, 
when the resulting datasets have more elements than expected, right now the 
error will go unnoticed.

I also wanted to ask, is there a way to execute only the python tests, 
since I want to unify the utilities in a file, but without knowing what is the 
execution path, I cannot make sure if the module will be imported correctly.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4002) [py] Improve testing infraestructure

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314023#comment-15314023
 ] 

ASF GitHub Bot commented on FLINK-4002:
---

Github user omaralvarez commented on the issue:

https://github.com/apache/flink/pull/2063
  
I have corrected Verify2(), but there is another case that is not checked, 
when the resulting datasets have more elements than expected, right now the 
error will go unnoticed.

I also wanted to ask, is there a way to execute only the python tests, 
since I want to unify the utilities in a file, but without knowing what is the 
execution path, I cannot make sure if the module will be imported correctly.  


> [py] Improve testing infraestructure
> 
>
> Key: FLINK-4002
> URL: https://issues.apache.org/jira/browse/FLINK-4002
> Project: Flink
>  Issue Type: Bug
>  Components: Python API
>Affects Versions: 1.0.3
>Reporter: Omar Alvarez
>Priority: Minor
>  Labels: Python, Testing
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Verify() test function errors out when array elements are missing:
> {code}
> env.generate_sequence(1, 5)\
>  .map(Id()).map_partition(Verify([1,2,3,4], "Sequence")).output()
> {code}
> {quote}
> IndexError: list index out of range
> {quote}
> There should also be more documentation in test functions.
> I am already working on a pull request to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4015) FlinkKafkaProducer08 fails when partition leader changes

2016-06-03 Thread Sebastian Klemke (JIRA)

Sebastian Klemke created FLINK-4015:
---

 Summary: FlinkKafkaProducer08 fails when partition leader changes
 Key: FLINK-4015
 URL: https://issues.apache.org/jira/browse/FLINK-4015
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: 1.0.2
Reporter: Sebastian Klemke


When leader for a partition changes, producer fails with the following 
exception:
{code}
06:34:50,813 INFO  org.apache.flink.yarn.YarnJobManager 
 - Status of job b323f5de3d32504651e861d5ecb27e7c (JOB_NAME) changed to FAILING.
java.lang.RuntimeException: Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
at 
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at OPERATOR.flatMap2(OPERATOR.java:82)
at OPERATOR.flatMap2(OPERATOR.java:16)
at 
org.apache.flink.streaming.api.operators.co.CoStreamFlatMap.processElement2(CoStreamFlatMap.java:63)
at 
org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:207)
at 
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Could not forward element to next 
operator
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
... 10 more
Caused by: java.lang.RuntimeException: Could not forward element to next 
operator
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:354)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:337)
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
... 13 more
Caused by: java.lang.Exception: Failed to send data to Kafka: This server is 
not the leader for that topic-partition.
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:282)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:249)
at 
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:351)
... 16 more
Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: This 
server is not the leader for that topic-partition.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314000#comment-15314000
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1856
  
I did not look at it in detail, just checked whether the builds passed.


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread zentol

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1856
  
I did not look at it in detail, just checked whether the builds passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313995#comment-15313995
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/1856
  
Thanks for the review @zentol . Ok. I will correct the line length. Does 
the overall approach look good?


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ramkrish86

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/1856
  
Thanks for the review @zentol . Ok. I will correct the line length. Does 
the overall approach look good?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313990#comment-15313990
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1856
  
Build is failing due to 54 scala style violations. (line length)


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #1856: FLINK-3650 Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread zentol

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/1856
  
Build is failing due to 54 scala style violations. (line length)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-06-03 Thread ramkrishna.s.vasudevan (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313977#comment-15313977
 ] 

ramkrishna.s.vasudevan commented on FLINK-3650:
---

[~till.rohrmann] [~fhueske]
Any chance of a review here?

> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3919) Distributed Linear Algebra: row-based matrix

2016-06-03 Thread Simone Robutti (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313955#comment-15313955
 ] 

Simone Robutti commented on FLINK-3919:
---

Umh, probably you're right. I checked breeze and they use addition for matrix 
addition and sum for element-wise sum. 

> Distributed Linear Algebra: row-based matrix
> 
>
> Key: FLINK-3919
> URL: https://issues.apache.org/jira/browse/FLINK-3919
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Simone Robutti
>Assignee: Simone Robutti
>
> Distributed matrix implementation as a DataSet of IndexedRow and related 
> operations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3919) Distributed Linear Algebra: row-based matrix

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313952#comment-15313952
 ] 

ASF GitHub Bot commented on FLINK-3919:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1996#discussion_r65688309
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.math.distributed
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.distributed.DistributedMatrix._
+import org.apache.flink.ml.math._
+
+/**
+  * Distributed row-major matrix representation.
+  * @param numRows Number of rows.
+  * @param numCols Number of columns.
+  */
+class DistributedRowMatrix(val data: DataSet[IndexedRow],
+   val numRows: Int,
+   val numCols: Int)
+extends DistributedMatrix {
+
+  /**
+* Collects the data in the form of a sequence of coordinates 
associated with their values.
+* @return
+*/
+  def toCOO: Seq[(MatrixRowIndex, MatrixColIndex, Double)] = {
+
+val localRows = data.collect()
+
+for (IndexedRow(rowIndex, vector) <- localRows;
+ (columnIndex, value) <- vector) yield (rowIndex, columnIndex, 
value)
+  }
+
+  /**
+* Collects the data in the form of a SparseMatrix
+* @return
+*/
+  def toLocalSparseMatrix: SparseMatrix = {
+val localMatrix =
+  SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO)
+require(localMatrix.numRows == this.numRows)
+require(localMatrix.numCols == this.numCols)
+localMatrix
+  }
+
+  //TODO: convert to dense representation on the distributed matrix and 
collect it afterward
+  def toLocalDenseMatrix: DenseMatrix = 
this.toLocalSparseMatrix.toDenseMatrix
+
+  /**
+* Apply a high-order function to couple of rows
+* @param fun
+* @param other
+* @return
+*/
+  def byRowOperation(fun: (Vector, Vector) => Vector,
+ other: DistributedRowMatrix): DistributedRowMatrix = {
+val otherData = other.data
+require(this.numCols == other.numCols)
+require(this.numRows == other.numRows)
+
+val result = this.data
+  .fullOuterJoin(otherData)
+  .where("rowIndex")
+  .equalTo("rowIndex")(
+  (left: IndexedRow, right: IndexedRow) => {
+val row1 = Option(left) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+right.rowIndex,
+SparseVector.fromCOO(right.values.size, List((0, 
0.0
+}
+val row2 = Option(right) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+left.rowIndex,
+SparseVector.fromCOO(left.values.size, List((0, 0.0
+}
+IndexedRow(row1.rowIndex, fun(row1.values, row2.values))
+  }
+  )
+new DistributedRowMatrix(result, numRows, numCols)
+  }
+
+  /**
+* Add the matrix to another matrix.
+* @param other
+* @return
+*/
+  def sum(other: DistributedRowMatrix): DistributedRowMatrix = {
--- End diff --

Is `add` more proper name for this method? Please do not update this PR if 
you agree with me but just notify me because I'm rebasing this PR on current 
master.


> Distributed Linear Algebra: row-based matrix
> 
>
> Key: FLINK-3919
> URL:

[GitHub] flink pull request #1996: [FLINK-3919][flink-ml] Distributed Linear Algebra:...

2016-06-03 Thread chiwanpark

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1996#discussion_r65688309
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.math.distributed
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.distributed.DistributedMatrix._
+import org.apache.flink.ml.math._
+
+/**
+  * Distributed row-major matrix representation.
+  * @param numRows Number of rows.
+  * @param numCols Number of columns.
+  */
+class DistributedRowMatrix(val data: DataSet[IndexedRow],
+   val numRows: Int,
+   val numCols: Int)
+extends DistributedMatrix {
+
+  /**
+* Collects the data in the form of a sequence of coordinates 
associated with their values.
+* @return
+*/
+  def toCOO: Seq[(MatrixRowIndex, MatrixColIndex, Double)] = {
+
+val localRows = data.collect()
+
+for (IndexedRow(rowIndex, vector) <- localRows;
+ (columnIndex, value) <- vector) yield (rowIndex, columnIndex, 
value)
+  }
+
+  /**
+* Collects the data in the form of a SparseMatrix
+* @return
+*/
+  def toLocalSparseMatrix: SparseMatrix = {
+val localMatrix =
+  SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO)
+require(localMatrix.numRows == this.numRows)
+require(localMatrix.numCols == this.numCols)
+localMatrix
+  }
+
+  //TODO: convert to dense representation on the distributed matrix and 
collect it afterward
+  def toLocalDenseMatrix: DenseMatrix = 
this.toLocalSparseMatrix.toDenseMatrix
+
+  /**
+* Apply a high-order function to couple of rows
+* @param fun
+* @param other
+* @return
+*/
+  def byRowOperation(fun: (Vector, Vector) => Vector,
+ other: DistributedRowMatrix): DistributedRowMatrix = {
+val otherData = other.data
+require(this.numCols == other.numCols)
+require(this.numRows == other.numRows)
+
+val result = this.data
+  .fullOuterJoin(otherData)
+  .where("rowIndex")
+  .equalTo("rowIndex")(
+  (left: IndexedRow, right: IndexedRow) => {
+val row1 = Option(left) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+right.rowIndex,
+SparseVector.fromCOO(right.values.size, List((0, 
0.0
+}
+val row2 = Option(right) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+left.rowIndex,
+SparseVector.fromCOO(left.values.size, List((0, 0.0
+}
+IndexedRow(row1.rowIndex, fun(row1.values, row2.values))
+  }
+  )
+new DistributedRowMatrix(result, numRows, numCols)
+  }
+
+  /**
+* Add the matrix to another matrix.
+* @param other
+* @return
+*/
+  def sum(other: DistributedRowMatrix): DistributedRowMatrix = {
--- End diff --

Is `add` more proper name for this method? Please do not update this PR if 
you agree with me but just notify me because I'm rebasing this PR on current 
master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (FLINK-4014) Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Description: 
I also run a five jobs, and all the slot filled.And I seem to find my five jobs 
is all hanged.

Following attachments are my jstack and job source code.

The hang.stack display all thread related to the job is WAIT. Why?

  was:
I also run a five jobs, and all the slot filled.And I seem to find my five jobs 
is all hanged.

Following attachments are my jstack and job source code.

The hang.stack display all thread related to the job is WAIT. 


> Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: FlinkJob_20160603_174748_03.java, hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code.
> The hang.stack display all thread related to the job is WAIT. Why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Summary: Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some 
problem  (was: Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some 
problem)

> Jobs hang. Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: FlinkJob_20160603_174748_03.java, hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code.
> The hang.stack display all thread related to the job is WAIT. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Description: 
I also run a five jobs, and all the slot filled.And I seem to find my five jobs 
is all hanged.

Following attachments are my jstack and job source code.

The hang.stack display all thread related to the job is WAIT. 

  was:
I also run a five jobs, and all the slot filled.And I seem to find my five jobs 
is all hanged.

Following attachments are my jstack and job source code。


> Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: FlinkJob_20160603_174748_03.java, hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code.
> The hang.stack display all thread related to the job is WAIT. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #2066: Updated ssh configuration in base Dockerfile

2016-06-03 Thread techmaniack

GitHub user techmaniack opened a pull request:

https://github.com/apache/flink/pull/2066

Updated ssh configuration in base Dockerfile

  - The pull request addresses only one issue

Won't allow ssh into the container as 'without-password' is now replaced 
with 'prohibit-password' in Xenial.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/techmaniack/flink patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2066


commit 768723ccf9cf2cab09905063cbf5f7ddf4296494
Author: AbdulKarim Memon 
Date:   2016-06-03T10:20:51Z

Updated ssh configuration in base Dockerfile

Won't allow ssh into the container as 'without-password' is now replaced 
with 'prohibit-password' in Xenial.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313944#comment-15313944
 ] 

ZhengBowen commented on FLINK-4014:
---

[~StephanEwen]
can you help me?

> Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: FlinkJob_20160603_174748_03.java, hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Attachment: FlinkJob_20160603_174748_03.java

> Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: FlinkJob_20160603_174748_03.java, hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and Local、

2016-06-03 Thread ZhengBowen (JIRA)

ZhengBowen created FLINK-4014:
-

 Summary: Jobs hang， Maybe NetworkBufferPool and Local、
 Key: FLINK-4014
 URL: https://issues.apache.org/jira/browse/FLINK-4014
 Project: Flink
  Issue Type: Bug
Reporter: ZhengBowen
 Attachments: hang.stack

I also run a five jobs, and all the slot filled.And I seem to find my five jobs 
is all hanged.

Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Attachment: hang.stack

> Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Summary: Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some 
problem  (was: Jobs hang， Maybe NetworkBufferPool and LocalBuffer)

> Jobs hang， Maybe NetworkBufferPool and LocalBufferPool has some problem
> ---
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4014) Jobs hang， Maybe NetworkBufferPool and LocalBuffer

2016-06-03 Thread ZhengBowen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhengBowen updated FLINK-4014:
--
Summary: Jobs hang， Maybe NetworkBufferPool and LocalBuffer  (was: Jobs 
hang， Maybe NetworkBufferPool and Local、)

> Jobs hang， Maybe NetworkBufferPool and LocalBuffer
> --
>
> Key: FLINK-4014
> URL: https://issues.apache.org/jira/browse/FLINK-4014
> Project: Flink
>  Issue Type: Bug
>Reporter: ZhengBowen
> Attachments: hang.stack
>
>
> I also run a five jobs, and all the slot filled.And I seem to find my five 
> jobs is all hanged.
> Following attachments are my jstack and job source code。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-06-03 Thread Simone Robutti (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313938#comment-15313938
 ] 

Simone Robutti commented on FLINK-1873:
---

That would be perfect :)

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-06-03 Thread Chiwan Park (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313915#comment-15313915
 ] 

Chiwan Park commented on FLINK-1873:


I think we don't need to hurry, but I'll review the first PR and merge it in 5 
hours if there is no more problem.

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-06-03 Thread Simone Robutti (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313903#comment-15313903
 ] 

Simone Robutti commented on FLINK-1873:
---

Ok. Anyway tomorrow will be my first day of holiday and from wednesday I won't 
have continous access to the internet for 2 weeks. I hope to get the first PR 
merged before that day so that I could submit the second PR. Otherwise, for 
trivial corrections to the first PR, I will hand over to a colleague of mine 
for the 2 weeks I'm missing.

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3937) Make flink cli list, savepoint, cancel and stop work on Flink-on-YARN clusters

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313890#comment-15313890
 ] 

ASF GitHub Bot commented on FLINK-3937:
---

Github user mxm commented on the issue:

https://github.com/apache/flink/pull/2034
  
Thanks for the update! 


> Make flink cli list, savepoint, cancel and stop work on Flink-on-YARN clusters
> --
>
> Key: FLINK-3937
> URL: https://issues.apache.org/jira/browse/FLINK-3937
> Project: Flink
>  Issue Type: Improvement
>Reporter: Sebastian Klemke
>Assignee: Maximilian Michels
>Priority: Trivial
> Attachments: improve_flink_cli_yarn_integration.patch
>
>
> Currently, flink cli can't figure out JobManager RPC location for 
> Flink-on-YARN clusters. Therefore, list, savepoint, cancel and stop 
> subcommands are hard to invoke if you only know the YARN application ID. As 
> an improvement, I suggest adding a -yid  option to the 
> mentioned subcommands that can be used together with -m yarn-cluster. Flink 
> cli would then retrieve JobManager RPC location from YARN ResourceManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2034: [FLINK-3937] Implemented -yid option to Flink cli list, s...

2016-06-03 Thread mxm

Github user mxm commented on the issue:

https://github.com/apache/flink/pull/2034
  
Thanks for the update! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3908) FieldParsers error state is not reset correctly to NONE

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313872#comment-15313872
 ] 

ASF GitHub Bot commented on FLINK-3908:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2007
  
I'm not sure; in any case it should not be removed as part of this PR.

You can open a separate JIRA or ask on the mailing list.


> FieldParsers error state is not reset correctly to NONE
> ---
>
> Key: FLINK-3908
> URL: https://issues.apache.org/jira/browse/FLINK-3908
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.0.2
>Reporter: Flavio Pompermaier
>Assignee: Flavio Pompermaier
>  Labels: parser
>
> If during the parse of a csv there's a parse error (for example when in a 
> integer column there are non-int values) the errorState is not reset 
> correctly in the next parseField call. A simple fix would be to add as a 
> first statement of the {{parseField()}} function a call to 
> {{setErrorState(ParseErrorState.NONE)}} but it is something that should be 
> handled better (by default) for every subclass of {{FieldParser}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink issue #2007: [FLINK-3908] Fixed Parser's error state reset

2016-06-03 Thread zentol

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/2007
  
I'm not sure; in any case it should not be removed as part of this PR.

You can open a separate JIRA or ask on the mailing list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2007: [FLINK-3908] Fixed Parser's error state reset

2016-06-03 Thread fpompermaier

Github user fpompermaier commented on the issue:

https://github.com/apache/flink/pull/2007
  
Now it should be ok, according to your suggestions. I misunderstood what 
@StephanEwen was trying to say, thanks @zentol  for the clarification!
Just another thing: the method GenericCsvInputFormat.checkAndCoSort() is 
never used in the code. Do you want to keep it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3908) FieldParsers error state is not reset correctly to NONE

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313864#comment-15313864
 ] 

ASF GitHub Bot commented on FLINK-3908:
---

Github user fpompermaier commented on the issue:

https://github.com/apache/flink/pull/2007
  
Now it should be ok, according to your suggestions. I misunderstood what 
@StephanEwen was trying to say, thanks @zentol  for the clarification!
Just another thing: the method GenericCsvInputFormat.checkAndCoSort() is 
never used in the code. Do you want to keep it?


> FieldParsers error state is not reset correctly to NONE
> ---
>
> Key: FLINK-3908
> URL: https://issues.apache.org/jira/browse/FLINK-3908
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.0.2
>Reporter: Flavio Pompermaier
>Assignee: Flavio Pompermaier
>  Labels: parser
>
> If during the parse of a csv there's a parse error (for example when in a 
> integer column there are non-int values) the errorState is not reset 
> correctly in the next parseField call. A simple fix would be to add as a 
> first statement of the {{parseField()}} function a call to 
> {{setErrorState(ParseErrorState.NONE)}} but it is something that should be 
> handled better (by default) for every subclass of {{FieldParser}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #1996: [FLINK-3919][flink-ml] Distributed Linear Algebra:...

2016-06-03 Thread chobeat

Github user chobeat commented on a diff in the pull request:

https://github.com/apache/flink/pull/1996#discussion_r65671113
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/distributed/DistributedRowMatrix.scala
 ---
@@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.math.distributed
+
+import breeze.linalg.{CSCMatrix => BreezeSparseMatrix, Matrix => 
BreezeMatrix, Vector => BreezeVector}
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{Matrix => FlinkMatrix, _}
+
+/**
+  * Distributed row-major matrix representation.
+  * @param numRows Number of rows.
+  * @param numCols Number of columns.
+  */
+class DistributedRowMatrix(val data: DataSet[IndexedRow],
+   val numRows: Int,
+   val numCols: Int )
+extends DistributedMatrix {
+
+
+
+  /**
+* Collects the data in the form of a sequence of coordinates 
associated with their values.
+* @return
+*/
+  def toCOO: Seq[(Int, Int, Double)] = {
+
+val localRows = data.collect()
+
+for (IndexedRow(rowIndex, vector) <- localRows;
+ (columnIndex, value) <- vector) yield (rowIndex, columnIndex, 
value)
+  }
+
+  /**
+* Collects the data in the form of a SparseMatrix
+* @return
+*/
+  def toLocalSparseMatrix: SparseMatrix = {
+val localMatrix =
+  SparseMatrix.fromCOO(this.numRows, this.numCols, this.toCOO)
+require(localMatrix.numRows == this.numRows)
+require(localMatrix.numCols == this.numCols)
+localMatrix
+  }
+
+  //TODO: convert to dense representation on the distributed matrix and 
collect it afterward
+  def toLocalDenseMatrix: DenseMatrix = 
this.toLocalSparseMatrix.toDenseMatrix
+
+  /**
+* Apply a high-order function to couple of rows
+* @param fun
+* @param other
+* @return
+*/
+  def byRowOperation(fun: (Vector, Vector) => Vector,
+ other: DistributedRowMatrix): DistributedRowMatrix = {
+val otherData = other.data
+require(this.numCols == other.numCols)
+require(this.numRows == other.numRows)
+
+val result = this.data
+  .fullOuterJoin(otherData)
+  .where("rowIndex")
+  .equalTo("rowIndex")(
+  (left: IndexedRow, right: IndexedRow) => {
+val row1 = Option(left) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+right.rowIndex,
+SparseVector.fromCOO(right.values.size, List((0, 
0.0
+}
+val row2 = Option(right) match {
+  case Some(row: IndexedRow) => row
+  case None =>
+IndexedRow(
+left.rowIndex,
+SparseVector.fromCOO(left.values.size, List((0, 0.0
+}
+IndexedRow(row1.rowIndex, fun(row1.values, row2.values))
+  }
+  )
+new DistributedRowMatrix(result, numRows, numCols)
+  }
+
+  /**
+* Add the matrix to another matrix.
+* @param other
+* @return
+*/
+  def sum(other: DistributedRowMatrix): DistributedRowMatrix = {
+val sumFunction: (Vector, Vector) => Vector = (x: Vector, y: Vector) =>
+  (x.asBreeze + y.asBreeze).fromBreeze
+this.byRowOperation(sumFunction, other)
+  }
+
+  /**
+* Subtracts another matrix.
+* @param other
+* @return
+*/
+  def subtract(other: DistributedRowMatrix): DistributedRowMatrix = {
+val subFunction: (Vector, Vector) => Vector = (x: Vector, y: Vector) =>
+  (x.asBreeze - y.asBreeze).fromBreeze
+this.byRowOperation(subFunction, other)
+

[GitHub] flink pull request #1517: [FLINK-3477] [runtime] Add hash-based combine stra...

2016-06-03 Thread ggevay

Github user ggevay commented on a diff in the pull request:

https://github.com/apache/flink/pull/1517#discussion_r65667687
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/operators/ReduceCombineDriver.java
 ---
@@ -42,34 +44,38 @@
  * Combine operator for Reduce functions, standalone (not chained).
  * Sorts and groups and reduces data, but never spills the sort. May 
produce multiple
  * partially aggregated groups.
- * 
+ *
  * @param  The data type consumed and produced by the combiner.
  */
 public class ReduceCombineDriver implements Driver {
-   
+
private static final Logger LOG = 
LoggerFactory.getLogger(ReduceCombineDriver.class);
 
/** Fix length records with a length below this threshold will be 
in-place sorted, if possible. */
private static final int THRESHOLD_FOR_IN_PLACE_SORTING = 32;
-   
-   
+
+
private TaskContext taskContext;
 
private TypeSerializer serializer;
 
private TypeComparator comparator;
-   
+
private ReduceFunction reducer;
-   
+
private Collector output;
-   
+
+   private DriverStrategy strategy;
+
private InMemorySorter sorter;
-   
+
private QuickSort sortAlgo = new QuickSort();
 
+   private ReduceHashTable table;
+
private List memory;
 
-   private boolean running;
+   private volatile boolean canceled;
--- End diff --

Sorry, I forgot the rename in the chained driver:
6abd3f3cf49568cc0fecd85d7e7d8a0d7f9ec39f
And I forgot to invert the meaning with the rename in ReduceCombineDriver:
984ba12f44a7ee9b16790c3e172b53969448e1c2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-3477) Add hash-based combine strategy for ReduceFunction

2016-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313791#comment-15313791
 ] 

ASF GitHub Bot commented on FLINK-3477:
---

Github user ggevay commented on a diff in the pull request:

https://github.com/apache/flink/pull/1517#discussion_r65667687
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/operators/ReduceCombineDriver.java
 ---
@@ -42,34 +44,38 @@
  * Combine operator for Reduce functions, standalone (not chained).
  * Sorts and groups and reduces data, but never spills the sort. May 
produce multiple
  * partially aggregated groups.
- * 
+ *
  * @param  The data type consumed and produced by the combiner.
  */
 public class ReduceCombineDriver implements Driver {
-   
+
private static final Logger LOG = 
LoggerFactory.getLogger(ReduceCombineDriver.class);
 
/** Fix length records with a length below this threshold will be 
in-place sorted, if possible. */
private static final int THRESHOLD_FOR_IN_PLACE_SORTING = 32;
-   
-   
+
+
private TaskContext taskContext;
 
private TypeSerializer serializer;
 
private TypeComparator comparator;
-   
+
private ReduceFunction reducer;
-   
+
private Collector output;
-   
+
+   private DriverStrategy strategy;
+
private InMemorySorter sorter;
-   
+
private QuickSort sortAlgo = new QuickSort();
 
+   private ReduceHashTable table;
+
private List memory;
 
-   private boolean running;
+   private volatile boolean canceled;
--- End diff --

Sorry, I forgot the rename in the chained driver:
6abd3f3cf49568cc0fecd85d7e7d8a0d7f9ec39f
And I forgot to invert the meaning with the rename in ReduceCombineDriver:
984ba12f44a7ee9b16790c3e172b53969448e1c2


> Add hash-based combine strategy for ReduceFunction
> --
>
> Key: FLINK-3477
> URL: https://issues.apache.org/jira/browse/FLINK-3477
> Project: Flink
>  Issue Type: Sub-task
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Gabor Gevay
>
> This issue is about adding a hash-based combine strategy for ReduceFunctions.
> The interface of the {{reduce()}} method is as follows:
> {code}
> public T reduce(T v1, T v2)
> {code}
> Input type and output type are identical and the function returns only a 
> single value. A Reduce function is incrementally applied to compute a final 
> aggregated value. This allows to hold the preaggregated value in a hash-table 
> and update it with each function call. 
> The hash-based strategy requires special implementation of an in-memory hash 
> table. The hash table should support in place updates of elements (if the 
> updated value has the same size as the new value) but also appending updates 
> with invalidation of the old value (if the binary length of the new value 
> differs). The hash table needs to be able to evict and emit all elements if 
> it runs out-of-memory.
> We should also add {{HASH}} and {{SORT}} compiler hints to 
> {{DataSet.reduce()}} and {{Grouping.reduce()}} to allow users to pick the 
> execution strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

71 matches

Mail list logo