[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122117624
  
Hi @pp86, are you still working on this PR? 
I would suggest to open a new PR for the documentation (FLINK-2362).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630487#comment-14630487
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/905#discussion_r34847588
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java
 ---
@@ -274,4 +277,45 @@ public Integer map(POJO value) throws Exception {
return (int) value.nestedPojo.longNumber;
}
}
+
+   @Test
+   public void testCorrectnessOfDistinctOnAtomic() throws Exception {
+   /*
+   * check correctness of distinct on Integers
+   */
+
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env);
+   DataSetInteger reduceDs = ds.distinct();
+
+   ListInteger result = reduceDs.collect();
+
+   String expected = 1\n2\n3\n4\n5;
+
+   compareResultAsText(result, expected);
+   }
+
+   @Test
+   public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws 
Exception {
+   /*
--- End diff --

Looks like misaligned for the comment block.


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630488#comment-14630488
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/905#discussion_r34847597
  
--- Diff: 
flink-tests/src/test/scala/org/apache/flink/api/scala/operators/DistinctITCase.scala
 ---
@@ -174,6 +174,45 @@ class DistinctITCase(mode: TestExecutionMode) extends 
MultipleProgramsTestBase(m
 env.execute()
 expected = 1\n2\n3\n
   }
+
+  @Test
+  def testCorrectnessOfDistinctOnAtomic(): Unit = {
+/*
--- End diff --

Looks like misaligned for the comment block. 


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske closed FLINK-1963.

   Resolution: Fixed
Fix Version/s: (was: 0.9)
   0.10

fixed with 08ca9ffa9a95610c073145a09e731311e728c4fd

 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.10


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1748) Integrate PageRank implementation into machine learning library

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1748:
-
Priority: Minor  (was: Major)

 Integrate PageRank implementation into machine learning library
 ---

 Key: FLINK-1748
 URL: https://issues.apache.org/jira/browse/FLINK-1748
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Priority: Minor
  Labels: ML, Starter

 We already have an excellent approximative PageRank [1] implementation which 
 has been contributed by [~StephanEwen]. Making this implementation part of 
 the machine learning library would be a great addition.
 Resources:
 [1] [http://en.wikipedia.org/wiki/PageRank]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629436#comment-14629436
 ] 

ASF GitHub Bot commented on FLINK-2332:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/917

[FLINK-2332] [runtime] Adds leader session IDs and registration session IDs

## Registration Session IDs

Introduces registration session IDs for all registration messages. Upon 
receiving a registration message, this ID is checked and if not correct, the 
message is discarded. That way, it is possible to distinguish old registration 
messages which are delayed from valid ones. 

In the current implementation, a static registration session ID is assigned 
when the `TaskManager` actor is created. However, with support for high 
availability, where the leader can change while trying to register at old 
leader, it becomes important to distinguish old from new registration messages.

## Leader Session IDs

In order to support high availability, we not only have to distinguish the 
old from the new registration messages, but also control messages which are 
sent to and from the `JobManager` and the `TaskManager`. In order to filter out 
possibly old messages, this PR introduces a leader session ID which denotes the 
currently valid messages. However, unlike the registration session ID, leader 
session IDs are assigned to messages transparently. 

Messages which extend the `RequiresLeaderSessionID` interface will be 
wrapped in a `LeaderSessionMessage` which also contains the currently known 
leader session ID. At the receiving end, the `LeaderSessionMessages` are 
unpacked and the received leader session ID is compared to the currently stored 
leader session ID. If both IDs are the same, the wrapped message is processed. 
If not, then wrapped message will be discarded.

In order to support this behaviour, the PR introduces a new `FlinkActor` 
for Scala actors and a `FlinkUntypedActor` for Java actors. Both actors provide 
a `decorateMessage` method which allows sub types of the 
`FlinkActor`/`FlinkUntypedActor` to decorate outgoing messages. Therefore, all 
implementing classes are supposed to call `decorateMessage` before sending a 
message to another actor.

The `FlinkUntypedActor` already comes with support for message logging and 
leader session message filtering. Furthermore, its `decorateMessage` method 
implementation checks for each message if it's an subtype of 
`RequiresLeaderSessionID` or not and if it is the case, then wraps this message 
in a `LeaderSessionMessage`. The receive method of this class, will then take 
care to unwrap the messages accordingly.

In order to have the same behavior with Scala actors one has to extend the 
`FlinkActor` and mixin the `LeaderSessionMessages` and `LoggingMessages` 
mixins. They effectively do the same as the `FlinkUntypedActor`, but offer a 
better extensibility of the Scala actors in the future.

In case that a `RequiresLeaderSessionID` message is received without being 
wrapped in a `LeaderSessionMessage` by a FlinkActor, an exception is thrown, 
which effectively terminates the execution of the actor. The reason for this is 
that a unwrapped message might leave the system in an inconsistent state if 
it's a message from an old leader. Furthermore, since not every 
`RequiresLeaderSessionID` message requires a response, it is not possible to 
notify the sender of the wrong message about the missing leader session ID.

## ActorGateway refactoring

In order to guarantee the similar wrapping behaviour when one sends 
messages outside of an actor, the former `InstanceGateway` has been refactored 
to `ActorGateway` and all `ActorRef` interactions have been replaced by 
`ActorGateway` instances. Only the web server still uses `ActorRefs`, because 
it is about to be refactored anyway (see #677). However, the PR #677 should be 
updated accordingly. 

`AkkaActorGateway` implements the `ActorGateway` and makes sure that all 
`RequiresLeaderSessionID` messages are wrapped correctly in a 
`LeaderSessionMessage`. In order to make this happen, the `AkkaActorGateway` is 
given the current leader session ID upon creation. For any actor interaction 
from outside of an actor, this class should be used. Using this abstraction 
will allow us to easily extend the decoration behaviour of messages in the 
future, too. 

## Style formatting

The PR also contains some Scala style harmonisation of old Scala code.

## TL;DR

In order to support leader session IDs all Flink actors should extend 
`FlinkUntypedActor` or `FlinkActor` with `LeaderSessionMessages` and 
`LogMessages` mixins. Whenever a message is sent from within an actor, the 
message should be decorated by calling `decorateMessage`. When messages 

[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629503#comment-14629503
 ] 

Suminda Dharmasena commented on FLINK-2366:
---

You are using it as a library.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629505#comment-14629505
 ] 

Till Rohrmann commented on FLINK-2366:
--

But why can you not do the same with ZK? If you start a DB process next to your 
Flink cluster, then you can also start a ZK process, right?

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629515#comment-14629515
 ] 

Fabian Hueske commented on FLINK-2366:
--

Besides being a distributed data processing engine, Flink offers an embedded 
mode which executes Flink programs on Java collections. 
Is this what you refer to? If yes, this mode does not require HA. It runs in 
the same JVM process as your application and is not meant for large data sets. 
If that process goes down, you don't need HA anymore. I don't see another 
option than this mode to use Flink as a library.

Except for the embedded collection execution mode, Flink runs as a distributed 
system for which HA is a very useful feature. As [~senorcarbone] pointed out, 
HA in a distributed system means leader election, Paxos, etc. ZK provides all 
of this and reimplementing that in Flink is not an option, IMO.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2365) Review of How to contribute page

2015-07-16 Thread Enrique Bautista Barahona (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629540#comment-14629540
 ] 

Enrique Bautista Barahona commented on FLINK-2365:
--

Here's the PR: https://github.com/apache/flink-web/pull/3

 Review of How to contribute page
 

 Key: FLINK-2365
 URL: https://issues.apache.org/jira/browse/FLINK-2365
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Enrique Bautista Barahona
Priority: Minor

 While reading the [How to contribute 
 page|https://flink.apache.org/how-to-contribute.html] on the website I have 
 noticed some typos, broken links, inconsistent formatting, etc.
 I plan to submit a PR with some improvements soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2364) Link to Guide is Broken

2015-07-16 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels resolved FLINK-2364.
---
Resolution: Not A Problem

Thanks for reporting. Due to the new website's structure, the guide is now 
located at 
https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html.
 Currently, we cannot use redirects on our documentation infrastructure.

 Link to Guide is Broken
 ---

 Key: FLINK-2364
 URL: https://issues.apache.org/jira/browse/FLINK-2364
 Project: Flink
  Issue Type: Bug
Reporter: Suminda Dharmasena

 https://ci.apache.org/projects/flink/flink-docs-master/libs/programming_guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629494#comment-14629494
 ] 

Suminda Dharmasena commented on FLINK-2366:
---

You can have an embedded DB to which state is synchronised. ZK can be a 
secondary option.

If you are using Flink purely embed you cannot deploy other software unless 
your program handles it.  This is inconvenient can you cannot take into account 
all the options.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629500#comment-14629500
 ] 

Fabian Hueske commented on FLINK-2366:
--

What exactly do you mean by using Flink purely embedded?
High availability is a feature of distributed systems and I am not sure how 
that would apply to an embedded system.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629507#comment-14629507
 ] 

Till Rohrmann commented on FLINK-2366:
--

But in the end, you will execute your jobs on a Flink cluster, right? Or what 
do you do with Flink as a library?

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629514#comment-14629514
 ] 

Suminda Dharmasena commented on FLINK-2366:
---

Do not want to start another process. An embedded DB (Derby, H2, etc.) does not 
run as a separate process it is embedded within the program itself.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2267) Support multi-class scoring for binary classification scores

2015-07-16 Thread Rohit Shinde (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629549#comment-14629549
 ] 

Rohit Shinde commented on FLINK-2267:
-

Can this issue be assigned to me? What all should I need to know before I start 
with this? This will be my first issue.

 Support multi-class scoring for binary classification scores
 

 Key: FLINK-2267
 URL: https://issues.apache.org/jira/browse/FLINK-2267
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor

 Some scores like accuracy, recall and F-score are designed for binary 
 classification.
 They can be used to evaluate multi-class problems as well by using micro or 
 macro averaging techniques.
 This ticket is about creating such an option,  allowing our binary 
 classification metrics to be used in multi-class problems.
 You can check out [sklearn's user 
 guide|http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification]
  for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629565#comment-14629565
 ] 

Till Rohrmann commented on FLINK-2366:
--

What do you mean exactly with embedded and distributed?

If you use Flink's embedded mode, then this would mean that every node would 
work independently of each other. There is no possibility to make the embedded 
Flink instances to work together.

If you want to run Flink in a distributed manner, then you have to start a 
Flink cluster. And then you can also start a ZK on the same nodes, if there is 
none available. But usually, you want to run these kind of things on highly 
reliable nodes and not in yarn containers, for example.

On a first glance, copycat seems to be usable for HA, as well. It offers a 
similar functionality what we're currently using from ZK. It should not be a 
problem to implement a new {{LeaderElectionService}}/{{LeaderRetrievalService}} 
which uses copycat. However, copycat is still under heavy development and not 
recommended to be used in production. And you have the problem that your 
fault-tolerant value store is running on the same nodes as Flink. These nodes 
don't have to be necessarily reliable.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2364) Link to Guide is Broken

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629508#comment-14629508
 ] 

Suminda Dharmasena commented on FLINK-2364:
---

I cannot rememeber the refereing ddocument but there are many Flink documents 
which refere to invalid links. Perhaps they have moved.

 Link to Guide is Broken
 ---

 Key: FLINK-2364
 URL: https://issues.apache.org/jira/browse/FLINK-2364
 Project: Flink
  Issue Type: Bug
Reporter: Suminda Dharmasena

 https://ci.apache.org/projects/flink/flink-docs-master/libs/programming_guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629532#comment-14629532
 ] 

Suminda Dharmasena commented on FLINK-2366:
---

What I am pushing for is embedded, HA, and distributed. Any of the nodes should 
be able to discover other nodes and cluster. The program is distributed but 
each instance embeds Flink as library. There are DB which can be used as 
libraries like Derby, H2. This can be used. 

Also there are libraries which are not heavy weigh and can be embeded as a 
library like: https://github.com/kuujo/copycat.

Alternatively you can see how you can embed ZK and use it while the whole of 
Flink can be used as a library in a distributed, HA context.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2300) Links on FAQ page not rendered correctly

2015-07-16 Thread Enrique Bautista Barahona (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628555#comment-14628555
 ] 

Enrique Bautista Barahona edited comment on FLINK-2300 at 7/16/15 10:38 AM:


[~mxm], I've thought about this and maybe it doesn't make much sense to update 
the links.

The plugin let us hide the base url but, after all:
- there aren't that much links of that type, and they aren't that much 
different from a regular link
- you have to write the rest of the path to the file or folder, which can and 
usually will be quite long and messy (e.g. 
{{flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java#L831}}).
- there are a lot of other explicit markdown links to the GitHub Apache repo, 
the Stratosphere repo, the issue tracker, mailing lists, etc.; some of them 
with hardcoded urls, some of them reading constants from the config file.

I think it would be simpler to just not use the plugin at all (remove it, even) 
and instead write regular markdown links. If you still want to abstract or 
reuse the urls, it may be better to just use the config file. But for the sake 
of consistency, maybe it should be done with every link, and I'm not sure it's 
worth it.

What do you think?


was (Author: ebautistabar):
[~mxm], I've thought about this and maybe it doesn't make much sense to update 
the links.

The plugin let us hide the base url but, after all:
- there aren't that much links of that type, and they aren't that much 
different from a regular link
- you have to write the rest of the path to the file or folder, which can and 
usually will be quite long and messy (e.g. 
{{flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java#L831}}).
- there are a lot of other explicit markdown links to the GitHub Apache repo, 
the Stratosphere repo, the issue tracker, mailing lists, etc.; some of them 
with hardcoded urls, some of them reading constants from the config file.

I think it would be simpler to just not use the plugin at all (remove it, even) 
and instead write regular markdown links. If you still want to abstract or 
reuse the urls, it may be better to just use the config file. But for the sake 
of consistency, maybe it should be done with every link, and I'm not sure it's 
worth it.

 Links on FAQ page not rendered correctly
 

 Key: FLINK-2300
 URL: https://issues.apache.org/jira/browse/FLINK-2300
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Robert Metzger
Assignee: Maximilian Michels
Priority: Minor
  Labels: starter
 Attachments: fix_github_plugin.patch


 On the Flink website, the links using the github plugin are broken.
 For example
 {code}
 {% github README.md master build instructions %}
 {code}
 renders to
 {code}
 https://github.com/apache/flink/tree/master/README.md
 {code}
 See:
 http://flink.apache.org/faq.html#my-job-fails-early-with-a-javaioeofexception-what-could-be-the-cause
 I was not able to resolve the issue by using {{a}} tags or markdown links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...

2015-07-16 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121924666
  
Okay. @tillrohrmann, can you review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629545#comment-14629545
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121924666
  
Okay. @tillrohrmann, can you review this?


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...

2015-07-16 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121928501
  
Will do, once I have some free time. Currently I've to finish the high
availability support. Try to hurry up.

On Thu, Jul 16, 2015 at 12:50 PM, Sachin Goel notificati...@github.com
wrote:

 Okay. @tillrohrmann https://github.com/tillrohrmann, can you review
 this?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/flink/pull/757#issuecomment-121924666.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (FLINK-1748) Integrate PageRank implementation into machine learning library

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-1748.

Resolution: Done

As part of Gelly.

 Integrate PageRank implementation into machine learning library
 ---

 Key: FLINK-1748
 URL: https://issues.apache.org/jira/browse/FLINK-1748
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Priority: Minor
  Labels: ML, Starter

 We already have an excellent approximative PageRank [1] implementation which 
 has been contributed by [~StephanEwen]. Making this implementation part of 
 the machine learning library would be a great addition.
 Resources:
 [1] [http://en.wikipedia.org/wiki/PageRank]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2332] [runtime] Adds leader session IDs...

2015-07-16 Thread tillrohrmann
GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/917

[FLINK-2332] [runtime] Adds leader session IDs and registration session IDs

## Registration Session IDs

Introduces registration session IDs for all registration messages. Upon 
receiving a registration message, this ID is checked and if not correct, the 
message is discarded. That way, it is possible to distinguish old registration 
messages which are delayed from valid ones. 

In the current implementation, a static registration session ID is assigned 
when the `TaskManager` actor is created. However, with support for high 
availability, where the leader can change while trying to register at old 
leader, it becomes important to distinguish old from new registration messages.

## Leader Session IDs

In order to support high availability, we not only have to distinguish the 
old from the new registration messages, but also control messages which are 
sent to and from the `JobManager` and the `TaskManager`. In order to filter out 
possibly old messages, this PR introduces a leader session ID which denotes the 
currently valid messages. However, unlike the registration session ID, leader 
session IDs are assigned to messages transparently. 

Messages which extend the `RequiresLeaderSessionID` interface will be 
wrapped in a `LeaderSessionMessage` which also contains the currently known 
leader session ID. At the receiving end, the `LeaderSessionMessages` are 
unpacked and the received leader session ID is compared to the currently stored 
leader session ID. If both IDs are the same, the wrapped message is processed. 
If not, then wrapped message will be discarded.

In order to support this behaviour, the PR introduces a new `FlinkActor` 
for Scala actors and a `FlinkUntypedActor` for Java actors. Both actors provide 
a `decorateMessage` method which allows sub types of the 
`FlinkActor`/`FlinkUntypedActor` to decorate outgoing messages. Therefore, all 
implementing classes are supposed to call `decorateMessage` before sending a 
message to another actor.

The `FlinkUntypedActor` already comes with support for message logging and 
leader session message filtering. Furthermore, its `decorateMessage` method 
implementation checks for each message if it's an subtype of 
`RequiresLeaderSessionID` or not and if it is the case, then wraps this message 
in a `LeaderSessionMessage`. The receive method of this class, will then take 
care to unwrap the messages accordingly.

In order to have the same behavior with Scala actors one has to extend the 
`FlinkActor` and mixin the `LeaderSessionMessages` and `LoggingMessages` 
mixins. They effectively do the same as the `FlinkUntypedActor`, but offer a 
better extensibility of the Scala actors in the future.

In case that a `RequiresLeaderSessionID` message is received without being 
wrapped in a `LeaderSessionMessage` by a FlinkActor, an exception is thrown, 
which effectively terminates the execution of the actor. The reason for this is 
that a unwrapped message might leave the system in an inconsistent state if 
it's a message from an old leader. Furthermore, since not every 
`RequiresLeaderSessionID` message requires a response, it is not possible to 
notify the sender of the wrong message about the missing leader session ID.

## ActorGateway refactoring

In order to guarantee the similar wrapping behaviour when one sends 
messages outside of an actor, the former `InstanceGateway` has been refactored 
to `ActorGateway` and all `ActorRef` interactions have been replaced by 
`ActorGateway` instances. Only the web server still uses `ActorRefs`, because 
it is about to be refactored anyway (see #677). However, the PR #677 should be 
updated accordingly. 

`AkkaActorGateway` implements the `ActorGateway` and makes sure that all 
`RequiresLeaderSessionID` messages are wrapped correctly in a 
`LeaderSessionMessage`. In order to make this happen, the `AkkaActorGateway` is 
given the current leader session ID upon creation. For any actor interaction 
from outside of an actor, this class should be used. Using this abstraction 
will allow us to easily extend the decoration behaviour of messages in the 
future, too. 

## Style formatting

The PR also contains some Scala style harmonisation of old Scala code.

## TL;DR

In order to support leader session IDs all Flink actors should extend 
`FlinkUntypedActor` or `FlinkActor` with `LeaderSessionMessages` and 
`LogMessages` mixins. Whenever a message is sent from within an actor, the 
message should be decorated by calling `decorateMessage`. When messages are 
sent from outside of an actor, the `AkkaActorGateway` should be used instead of 
directly using `ActorRef`. That way, we guarantee the proper wrapping of the 
messages.

You can merge this pull request into a Git repository by running:

$ git 

[jira] [Updated] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2366:
-
Priority: Minor  (was: Blocker)

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Bug
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Paris Carbone (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629373#comment-14629373
 ] 

Paris Carbone commented on FLINK-2366:
--

That's a tough one! ^^
If you mean using something else, we could use a transactional persistent 
database which has the same impact with zookeeper (perhaps better throughput).
If on the other hand you mean creating something totally custom that can be 
pretty tricky as Till mentioned. We need leader election, 2-phase paxos commit 
etc. These things are already solved by zookeeper


 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Ufuk Celebi (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629596#comment-14629596
 ] 

Ufuk Celebi commented on FLINK-2366:


I agree with Till. The use case with embedded mode (collections executions) 
seems off.

But I think there is a nice related use case where we just want to be able to 
restart a single job manager and continue where we left off. Then we only need 
the persistent storage part and no distributed consensus.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2368) Add a convergence criteria

2015-07-16 Thread Sachin Goel (JIRA)
Sachin Goel created FLINK-2368:
--

 Summary: Add a convergence criteria
 Key: FLINK-2368
 URL: https://issues.apache.org/jira/browse/FLINK-2368
 Project: Flink
  Issue Type: Bug
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel


Add a convergence criteria for the Machine Learning library. If an algorithm is 
iteration based, it should have a setConvergenceCritieria function which will 
check for convergence at the end of every iteration based on the user defined 
criteria. 
Algorithm writer will provide the user will the training data set, solution 
before an iteration and solution after the iteration and user will have to 
return an empty data set in case the training needs to be terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2367) “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete

2015-07-16 Thread chenliang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated FLINK-2367:
-
Issue Type: Improvement  (was: Bug)

 “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected 
 after mistaking delete
 -

 Key: FLINK-2367
 URL: https://issues.apache.org/jira/browse/FLINK-2367
 Project: Flink
  Issue Type: Improvement
  Components: JobManager
Affects Versions: 0.9
 Environment: Linux
Reporter: chenliang
Priority: Minor
  Labels: reliability
 Fix For: 0.9.0


 For checking system whether be adequately reliability, testers usually 
 designedly do some delete operation.
 Steps:
 1.go to flink\build-target\log 
 2.delete “flink-xx-jobmanager-linux-3lsu.log file 
 3.Run jobs along with writing log info, meanwhile the system didn't give any 
 error info when the log info can't be wrote correctly.
 4.when some jobs be run failed , go to check log file for finding the reason, 
 can't find the log file. 
 Must restart Job Manager to regenerate the log file, then continue to run 
 jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629575#comment-14629575
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121928501
  
Will do, once I have some free time. Currently I've to finish the high
availability support. Try to hurry up.

On Thu, Jul 16, 2015 at 12:50 PM, Sachin Goel notificati...@github.com
wrote:

 Okay. @tillrohrmann https://github.com/tillrohrmann, can you review
 this?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/flink/pull/757#issuecomment-121924666.




 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2368) Add a convergence criteria

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629757#comment-14629757
 ] 

ASF GitHub Bot commented on FLINK-2368:
---

GitHub user sachingoel0101 opened a pull request:

https://github.com/apache/flink/pull/918

[FLINK-2368][ml]Adds convergence criteria [WIP]

Adds a convergence criteria class which allows the user to decide whether 
they want to terminate training at any point, based on the solutions before and 
after an iteration. 

Adding the first example to the SVM trainer.[Not finished yet].

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink convergence

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/918.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #918


commit 656010cd1ef7df5bffafadb6a8be9bd7db715533
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-07-16T13:50:57Z

Initial framework design




 Add a convergence criteria
 --

 Key: FLINK-2368
 URL: https://issues.apache.org/jira/browse/FLINK-2368
 Project: Flink
  Issue Type: Bug
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 Add a convergence criteria for the Machine Learning library. If an algorithm 
 is iteration based, it should have a setConvergenceCritieria function which 
 will check for convergence at the end of every iteration based on the user 
 defined criteria. 
 Algorithm writer will provide the user will the training data set, solution 
 before an iteration and solution after the iteration and user will have to 
 return an empty data set in case the training needs to be terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2368][ml]Adds convergence criteria [WIP...

2015-07-16 Thread sachingoel0101
Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/918#discussion_r34788837
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala
 ---
@@ -382,7 +402,11 @@ object SVM{
 
 // calculate the new weight vector by adding the weight vector 
updates to the weight
 // vector value
-weights.union(weightedDeltaWs).reduce { _ + _ }
+val newWeights = weights.union(weightedDeltaWs).reduce { _ + _ 
}
+if(convergenceDefined){
+  converger.get.construct(input,weights,newWeights)
--- End diff --

@tillrohrmann , could you take a look here? I am unable to work out the 
problem here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2368) Add a convergence criteria

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629761#comment-14629761
 ] 

ASF GitHub Bot commented on FLINK-2368:
---

Github user sachingoel0101 commented on a diff in the pull request:

https://github.com/apache/flink/pull/918#discussion_r34788837
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala
 ---
@@ -382,7 +402,11 @@ object SVM{
 
 // calculate the new weight vector by adding the weight vector 
updates to the weight
 // vector value
-weights.union(weightedDeltaWs).reduce { _ + _ }
+val newWeights = weights.union(weightedDeltaWs).reduce { _ + _ 
}
+if(convergenceDefined){
+  converger.get.construct(input,weights,newWeights)
--- End diff --

@tillrohrmann , could you take a look here? I am unable to work out the 
problem here.


 Add a convergence criteria
 --

 Key: FLINK-2368
 URL: https://issues.apache.org/jira/browse/FLINK-2368
 Project: Flink
  Issue Type: Bug
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 Add a convergence criteria for the Machine Learning library. If an algorithm 
 is iteration based, it should have a setConvergenceCritieria function which 
 will check for convergence at the end of every iteration based on the user 
 defined criteria. 
 Algorithm writer will provide the user will the training data set, solution 
 before an iteration and solution after the iteration and user will have to 
 return an empty data set in case the training needs to be terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2368][ml]Adds convergence criteria [WIP...

2015-07-16 Thread sachingoel0101
GitHub user sachingoel0101 opened a pull request:

https://github.com/apache/flink/pull/918

[FLINK-2368][ml]Adds convergence criteria [WIP]

Adds a convergence criteria class which allows the user to decide whether 
they want to terminate training at any point, based on the solutions before and 
after an iteration. 

Adding the first example to the SVM trainer.[Not finished yet].

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink convergence

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/918.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #918


commit 656010cd1ef7df5bffafadb6a8be9bd7db715533
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-07-16T13:50:57Z

Initial framework design




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-2369) On Windows, in testFailingSortingDataSinkTask the temp file is not removed

2015-07-16 Thread Gabor Gevay (JIRA)
Gabor Gevay created FLINK-2369:
--

 Summary: On Windows, in testFailingSortingDataSinkTask the temp 
file is not removed
 Key: FLINK-2369
 URL: https://issues.apache.org/jira/browse/FLINK-2369
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
 Environment: Windows 7 64-bit, JDK 8u51
Reporter: Gabor Gevay
Assignee: Gabor Gevay
Priority: Minor


The test fails with the assert at the very end (Temp output file has not been 
removed). This happens because FileOutputFormat.tryCleanupOnError can't delete 
the file, because it is still open (note, that Linux happily deletes open 
files).

A fix would be to have the  this.format.close();  not just in the finally block 
(in DataSinkTask.invoke), but also before the tryCleanupOnError call (around 
line 217).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2367) “flink-xx-jobmanager-linux-3lsu.log file can't auto be recovered/detected after mistaking delete

2015-07-16 Thread chenliang (JIRA)
chenliang created FLINK-2367:


 Summary: “flink-xx-jobmanager-linux-3lsu.log file can't auto be 
recovered/detected after mistaking delete
 Key: FLINK-2367
 URL: https://issues.apache.org/jira/browse/FLINK-2367
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 0.9
 Environment: Linux
Reporter: chenliang
Priority: Minor
 Fix For: 0.9.0


For checking system whether be adequately reliability, testers usually 
designedly do some delete operation.
Steps:
1.go to flink\build-target\log 
2.delete “flink-xx-jobmanager-linux-3lsu.log file 
3.Run jobs along with writing log info, meanwhile the system didn't give any 
error info when the log info can't be wrote correctly.
4.when some jobs be run failed , go to check log file for finding the reason, 
can't find the log file. 
Must restart Job Manager to regenerate the log file, then continue to run jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2370) Unify Client Code

2015-07-16 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created FLINK-2370:
--

 Summary: Unify Client Code
 Key: FLINK-2370
 URL: https://issues.apache.org/jira/browse/FLINK-2370
 Project: Flink
  Issue Type: Improvement
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax


Right now client code that send commands to JobManager is spread over multiple 
classes resulting in code duplication. Furthermore, each client offers a 
different sub-set of available commands.

The idea is to use Client.java as a central class, that provides all available 
command that the JobManager can handle. All other client use Client.java 
internally.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2111] Add stop signal to cleanly shut...

2015-07-16 Thread mjsax
Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/750#issuecomment-122071560
  
Added fix for https://issues.apache.org/jira/browse/FLINK-2338


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2352) [Graph Visualization] Integrate Gelly with Gephi

2015-07-16 Thread Andra Lungu (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andra Lungu updated FLINK-2352:
---
Assignee: Shivani Ghatge

 [Graph Visualization] Integrate Gelly with Gephi
 

 Key: FLINK-2352
 URL: https://issues.apache.org/jira/browse/FLINK-2352
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 0.10
Reporter: Andra Lungu
Assignee: Shivani Ghatge

 This integration will allow users to see the real-time progress of their 
 graph. They could also visually verify results for clustering algorithms, for 
 example. Gephi is free/open-source and provides support for all types of 
 networks, including dynamic and hierarchical graphs. 
 A first step would be to add the Gephi Toolkit to the pom.xml.
 https://github.com/gephi/gephi-toolkit
 Afterwards, a GraphBuilder similar to this one
 https://github.com/palmerabollo/test-twitter-graph/blob/master/src/main/java/es/guido/twitter/graph/GraphBuilder.java
 can be implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2361) flatMap + distinct gives erroneous results for big data sets

2015-07-16 Thread Andra Lungu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630360#comment-14630360
 ] 

Andra Lungu commented on FLINK-2361:


Hi, 

I tweaked the code to look like this:
DataSetEdgeString, NullValue edges = getEdgesDataSet(env);

DataSetVertexString, Long vertices = edges.flatMap(new 
FlatMapFunctionEdgeString, NullValue, VertexString, Long() {
@Override
public void flatMap(EdgeString, NullValue edge, 
CollectorVertexString, Long collector) throws Exception {
collector.collect(new VertexString, 
Long(edge.getSource(), Long.parseLong(edge.getSource(;
collector.collect(new VertexString, 
Long(edge.getTarget(), Long.parseLong(edge.getTarget(;
}
}).distinct();

if (fileOutput) {
vertices.writeAsCsv(vertexInputPath, \n, ,);
env.execute();
}

DataSetVertexString, Long rereadVertices = 
env.readCsvFile(vertexInputPath)

.fieldDelimiter(,).lineDelimiter(\n).ignoreComments(#)
.types(String.class, Long.class).map(new 
MapFunctionTuple2String, Long, VertexString, Long() {
@Override
public VertexString, Long 
map(Tuple2String, Long tuple2) throws Exception {
return new VertexString, 
Long(tuple2.f0, tuple2.f1);
}
});

GraphString, Long, NullValue graph = 
Graph.fromDataSet(rereadVertices, edges, env);

which is not what I would normally do, BTW... 

and I still get:
Caused by: java.lang.Exception: Target vertex '124518874' does not exist!.
at 
org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300)
at 
org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220)
at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at 
org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139)
at 
org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107)
at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:722)

Lucky day :| At least it's missing a different vertex now...

 flatMap + distinct gives erroneous results for big data sets
 

 Key: FLINK-2361
 URL: https://issues.apache.org/jira/browse/FLINK-2361
 Project: Flink
  Issue Type: Bug
  Components: Gelly
Affects Versions: 0.10
Reporter: Andra Lungu

 When running the simple Connected Components algorithm (currently in Gelly) 
 on the twitter follower graph, with 1, 100 or 1 iterations, I get the 
 following error:
 Caused by: java.lang.Exception: Target vertex '657282846' does not exist!.
   at 
 org.apache.flink.graph.spargel.VertexCentricIteration$VertexUpdateUdfSimpleVV.coGroup(VertexCentricIteration.java:300)
   at 
 org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run(CoGroupWithSolutionSetSecondDriver.java:220)
   at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
   at 
 org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139)
   at 
 org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:107)
   at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
   at java.lang.Thread.run(Thread.java:722)
 Now this is very bizzare as the DataSet of vertices is produced from the 
 DataSet of edges... Which means there cannot be a an edge with an invalid 
 target id... The method calls flatMap to isolate the src and trg ids and 
 distinct to ensure their uniqueness.  
 The algorithm works fine for smaller data sets... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630233#comment-14630233
 ] 

ASF GitHub Bot commented on FLINK-2111:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/750#issuecomment-122071560
  
Added fix for https://issues.apache.org/jira/browse/FLINK-2338


 Add stop signal to cleanly shutdown streaming jobs
 

 Key: FLINK-2111
 URL: https://issues.apache.org/jira/browse/FLINK-2111
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Runtime, JobManager, Local Runtime, 
 Streaming, TaskManager, Webfrontend
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Currently, streaming jobs can only be stopped using cancel command, what is 
 a hard stop with no clean shutdown.
 The new introduced stop signal, will only affect streaming source tasks 
 such that the sources can stop emitting data and shutdown cleanly, resulting 
 in a clean shutdown of the whole streaming job.
 This feature is a pre-requirment for 
 https://issues.apache.org/jira/browse/FLINK-1929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Update README.md

2015-07-16 Thread chenliag613
Github user chenliag613 closed the pull request at:

https://github.com/apache/flink/pull/912


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Suminda Dharmasena (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630814#comment-14630814
 ] 

Suminda Dharmasena commented on FLINK-2366:
---

In which case can you add the ability for Flink to startup and configure the ZK 
instance internally. Flink can be used as if it was a library without having to 
do any other deployments. Also in this case the JobManager will be started 
through the API not running a batch file.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Update README.md

2015-07-16 Thread chenliag613
Github user chenliag613 commented on the pull request:

https://github.com/apache/flink/pull/912#issuecomment-122147385
  
Thanks @chiwanpark @mxm .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122121617
  
No worries :-)
You better use a dedicated branch for your next PR instead of the master 
branch.

I'll go ahead and merge this PR now. Thanks for the contribution!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630455#comment-14630455
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122121617
  
No worries :-)
You better use a dedicated branch for your next PR instead of the master 
branch.

I'll go ahead and merge this PR now. Thanks for the contribution!


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122132484
  
Merged, thanks for the contribution @pp86 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630553#comment-14630553
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122132484
  
Merged, thanks for the contribution @pp86 !


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/905


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630551#comment-14630551
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/905


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630421#comment-14630421
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122117624
  
Hi @pp86, are you still working on this PR? 
I would suggest to open a new PR for the documentation (FLINK-2362).


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/905#discussion_r34847588
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/DistinctITCase.java
 ---
@@ -274,4 +277,45 @@ public Integer map(POJO value) throws Exception {
return (int) value.nestedPojo.longNumber;
}
}
+
+   @Test
+   public void testCorrectnessOfDistinctOnAtomic() throws Exception {
+   /*
+   * check correctness of distinct on Integers
+   */
+
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSetInteger ds = CollectionDataSets.getIntegerDataSet(env);
+   DataSetInteger reduceDs = ds.distinct();
+
+   ListInteger result = reduceDs.collect();
+
+   String expected = 1\n2\n3\n4\n5;
+
+   compareResultAsText(result, expected);
+   }
+
+   @Test
+   public void testCorrectnessOfDistinctOnAtomicWithSelectAllChar() throws 
Exception {
+   /*
--- End diff --

Looks like misaligned for the comment block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122131578
  
Good catch @hsaputra :-)
I'll fix it before merging


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630540#comment-14630540
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122131578
  
Good catch @hsaputra :-)
I'll fix it before merging


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-685) Add support for semi-joins

2015-07-16 Thread pietro pinoli (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630528#comment-14630528
 ] 

pietro pinoli commented on FLINK-685:
-

Dear [~fhueske], are you (community) still interested in this?

I am developing a project for genomics which relies on Flink, and having such 
feature may help us a lot.

I would love to take it, but I guess I'll need some help from you  (or some 
other guy), so as to get started:)

Let me know.
Thanks,
Pietro.



 Add support for semi-joins
 --

 Key: FLINK-685
 URL: https://issues.apache.org/jira/browse/FLINK-685
 Project: Flink
  Issue Type: New Feature
Reporter: GitHub Import
Priority: Minor
  Labels: github-import
 Fix For: pre-apache


 A semi-join is basically a join filter. One input is filtering and the 
 other one is filtered.
 A tuple of the filtered input is emitted exactly once if the filtering 
 input has one (ore more) tuples with matching join keys. That means that the 
 output of a semi-join has the same type as the filtered input and the 
 filtering input is completely discarded.
 In order to support a semi-join, we need to add an additional physical 
 execution strategy, that ensures, that a tuple of the filtered input is 
 emitted only once if the filtering input has more than one tuple with 
 matching keys. Furthermore, a couple of optimizations compared to standard 
 joins can be done such as storing only keys and not the full tuple of the 
 filtering input in a hash table.
  Imported from GitHub 
 Url: https://github.com/stratosphere/stratosphere/issues/685
 Created by: [fhueske|https://github.com/fhueske]
 Labels: enhancement, java api, runtime, 
 Milestone: Release 0.6 (unplanned)
 Created at: Mon Apr 14 12:05:29 CEST 2014
 State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1963] Improve distinct() transformation

2015-07-16 Thread pp86
Github user pp86 commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122118573
  
Hi @fhueske ,

I'm very sorry, this is the first time I work with github, I am doing some 
mess :| 
But, I think I finally succeeded in removing the unwanted commits for 
documentation.

I am not working on the code any more. 
Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1963) Improve distinct() transformation

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630431#comment-14630431
 ] 

ASF GitHub Bot commented on FLINK-1963:
---

Github user pp86 commented on the pull request:

https://github.com/apache/flink/pull/905#issuecomment-122118573
  
Hi @fhueske ,

I'm very sorry, this is the first time I work with github, I am doing some 
mess :| 
But, I think I finally succeeded in removing the unwanted commits for 
documentation.

I am not working on the code any more. 
Thanks


 Improve distinct() transformation
 -

 Key: FLINK-1963
 URL: https://issues.apache.org/jira/browse/FLINK-1963
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Assignee: pietro pinoli
Priority: Minor
  Labels: starter
 Fix For: 0.9


 The `distinct()` transformation is a bit limited right now with respect to 
 processing atomic key types:
 - `distinct(String ...)` works only for composite data types (POJO, tuple), 
 but wildcard expression should also be supported for atomic key types
 - `distinct()` only works for composite types, but should also work for 
 atomic key types
 - `distinct(KeySelector)` is the most generic one, but not very handy to use
 - `distinct(int ...)` works only for Tuple data types (which is fine)
 Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)