[jira] [Updated] (FLINK-3801) Upgrade Joda-Time library to 2.9.3

2016-05-06 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated FLINK-3801:
--
Description: 
Currently yoda-time 2.5 is used which was very old.

We should upgrade to 2.9.3

  was:
Currently yoda-time 2.5 is used which was very old.


We should upgrade to 2.9.3


> Upgrade Joda-Time library to 2.9.3
> --
>
> Key: FLINK-3801
> URL: https://issues.apache.org/jira/browse/FLINK-3801
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> Currently yoda-time 2.5 is used which was very old.
> We should upgrade to 2.9.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3753) KillerWatchDog should not use kill on toKill thread

2016-05-06 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated FLINK-3753:
--
Description: 
{code}
// this is harsh, but this watchdog is a last resort
if (toKill.isAlive()) {
  toKill.stop();
}
{code}
stop() is deprecated.

See:
https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads

  was:
{code}
// this is harsh, but this watchdog is a last resort
if (toKill.isAlive()) {
  toKill.stop();
}
{code}
stop() is deprecated.


See:
https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads


> KillerWatchDog should not use kill on toKill thread
> ---
>
> Key: FLINK-3753
> URL: https://issues.apache.org/jira/browse/FLINK-3753
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> {code}
> // this is harsh, but this watchdog is a last resort
> if (toKill.isAlive()) {
>   toKill.stop();
> }
> {code}
> stop() is deprecated.
> See:
> https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274657#comment-15274657
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62384029
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
--- End diff --

I am removing "count" from the descriptions since "degree" implies a count.


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62384029
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
--- End diff --

I am removing "count" from the descriptions since "degree" implies a count.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3857) Add reconnect attempt to Elasticsearch host

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274627#comment-15274627
 ] 

ASF GitHub Bot commented on FLINK-3857:
---

Github user sbcd90 commented on the pull request:

https://github.com/apache/flink/pull/1962#issuecomment-217540405
  
Hello @rmetzger ,

I added a testcase now to the `ElasticsearchSinkITCase.java` list of tests. 
Can you kindly have a look once?


> Add reconnect attempt to Elasticsearch host
> ---
>
> Key: FLINK-3857
> URL: https://issues.apache.org/jira/browse/FLINK-3857
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.1.0, 1.0.2
>Reporter: Fabian Hueske
>Assignee: Subhobrata Dey
>
> Currently, the connection to the Elasticsearch host is opened in 
> {{ElasticsearchSink.open()}}. In case the connection is lost (maybe due to a 
> changed DNS entry), the sink fails.
> I propose to catch the Exception for lost connections in the {{invoke()}} 
> method and try to re-open the connection for a configurable number of times 
> with a certain delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3857][Streaming Connectors]Add reconnec...

2016-05-06 Thread sbcd90
Github user sbcd90 commented on the pull request:

https://github.com/apache/flink/pull/1962#issuecomment-217540405
  
Hello @rmetzger ,

I added a testcase now to the `ElasticsearchSinkITCase.java` list of tests. 
Can you kindly have a look once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274561#comment-15274561
 ] 

ASF GitHub Bot commented on FLINK-2044:
---

Github user gallenvara commented on the pull request:

https://github.com/apache/flink/pull/1956#issuecomment-217527593
  
Hi, @vasia @greghogan ,
- code modified and support to return both hub and authority score as 
`Tuple2` type now. The implementation will run extra one iteration to normalize 
the hub value in the end. With scatter-gather being implemented, the GSA 
version is easy to write.
- As for the threshold, the algorithm should check neighboring hub or 
authority iteration whether there are vertex updating. It's a little difficult 
and from my view, the` maxIteration` parameter play a similar role with 
threshold. 
- I will open another issue for GSA version and relevant tests.


> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Ahamd Javid
>Assignee: GaoLun
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. the feature branch 
> can be found here: (https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-3618) Rename abstract UDF classes in Scatter-Gather implementation

2016-05-06 Thread Greg Hogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Hogan reassigned FLINK-3618:
-

Assignee: Greg Hogan

> Rename abstract UDF classes in Scatter-Gather implementation
> 
>
> Key: FLINK-3618
> URL: https://issues.apache.org/jira/browse/FLINK-3618
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Affects Versions: 1.1.0, 1.0.1
>Reporter: Martin Junghanns
>Assignee: Greg Hogan
>Priority: Minor
>
> We now offer three Vertex-centric computing abstractions:
> * Pregel
> * Gather-Sum-Apply
> * Scatter-Gather
> Each of these abstractions provides abstract classes that need to be 
> implemented by the user:
> * Pregel: {{ComputeFunction}}
> * GSA: {{GatherFunction}}, {{SumFunction}}, {{ApplyFunction}}
> * Scatter-Gather: {{MessagingFunction}}, {{VertexUpdateFunction}}
> In Pregel and GSA, the names of those functions follow the name of the 
> abstraction or the name suggested in the corresponding papers. For 
> consistency of the API, I propose to rename {{MessageFunction}} to 
> {{ScatterFunction}} and {{VertexUpdateFunction}} to {{GatherFunction}}.
> Also for consistency, I would like to change the parameter order in 
> {{Graph.runScatterGatherIteration(VertexUpdateFunction f1, MessagingFunction 
> f2}} to  {{Graph.runScatterGatherIteration(ScatterFunction f1, GatherFunction 
> f2}} (like in {{Graph.runGatherSumApplyFunction(...)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274400#comment-15274400
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user ankitcha commented on the pull request:

https://github.com/apache/flink/pull/1947#issuecomment-217507497
  
@rmetzger thanks for the response. I think I got what you explained and I 
agree that using index in configuration keys won't be a nice experience. 

But, can we support nested structure in flink conf? I am unsure about the 
scope of this change, so maybe its a bad suggestion. But, this is something 
that will really help me out to put our application in production and we have 
to use multiple reporters as part of our infrastructure requirements. 


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3877) Create TranslateFunction interface for Graph translators

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274331#comment-15274331
 ] 

ASF GitHub Bot commented on FLINK-3877:
---

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/1968

[FLINK-3877] [gelly] Create TranslateFunction interface for Graph 
translators

The TranslateFunction interface is similar to MapFunction but may be called 
multiple times before serialization.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
3877_create_translatefunction_interface_for_graph_translators

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1968


commit b1d838a276c9c0f2e6b79b34f924cc42c93c371d
Author: Greg Hogan 
Date:   2016-05-04T20:51:23Z

[FLINK-3877] [gelly] Create TranslateFunction interface for Graph 
translators

The TranslateFunction interface is similar to MapFunction but may be
called multiple times before serialization.




> Create TranslateFunction interface for Graph translators
> 
>
> Key: FLINK-3877
> URL: https://issues.apache.org/jira/browse/FLINK-3877
> Project: Flink
>  Issue Type: Bug
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
>Priority: Minor
> Fix For: 1.1.0
>
>
> I now recall why FLINK-3771 had a {{Translator}} interface with a 
> {{translate}} method taking a field for reuse: when we translate edge ID the 
> translator must be called twice.
> {{TranslateFunction}} will be modeled after {{MapFunction}} and 
> {{RichTranslateFunction}} will be modeled after {{RichMapFunction}}.
> The unit test should have caught this but I was reusing values between fields 
> which did not detect that values were overwritten.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Greg Hogan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274307#comment-15274307
 ] 

Greg Hogan commented on FLINK-3879:
---

I think FLINK-2044 can test for convergence if the last three scores are stored 
for each vertex. That would also make it trivial to return both scores. So, 
yes, performance.

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3880) Use ConcurrentHashMap for Accumulators

2016-05-06 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274298#comment-15274298
 ] 

Maximilian Michels commented on FLINK-3880:
---

You're right, the synchronized map is a bottle neck. Actually, it is not even 
necessary that it synchronizes. In a regular Flink job, it can only be accessed 
by one task at a time. Only if the user spawned additional threads, it could be 
concurrently modified. In this case the user would have to take care of the 
synchronization (and if not get a ConcurrentModificationException). So we can 
simply make it a normal map.

> Use ConcurrentHashMap for Accumulators
> --
>
> Key: FLINK-3880
> URL: https://issues.apache.org/jira/browse/FLINK-3880
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Ken Krugler
>Priority: Minor
>
> I was looking at improving DataSet performance - this is for a job created 
> using the Cascading-Flink planner for Cascading 3.1.
> While doing a quick "poor man's profiler" session with one of the TaskManager 
> processes, I noticed that many (most?) of the threads that were actually 
> running were in this state:
> {code:java}
> "DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 
> nid=0x666a runnable [0x7f556abcf000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Collections$SynchronizedMap.get(Collections.java:2037)
> - locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap)
> at 
> org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162)
> at 
> org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122)
> at 
> cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65)
> at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97)
> at 
> cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166)
> at 
> cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139)
> at 
> com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70)
> at 
> com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175)
> at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)}}}
> {code}
> It looks like Cascading is asking Flink to increment a counter with each 
> Tuple read, and that in turn is often blocked on getting access to the 
> Accumulator object in a map. It looks like this is a SynchronizedMap, but 
> using a ConcurrentHashMap (for example) would reduce this contention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-3880) Use ConcurrentHashMap for Accumulators

2016-05-06 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels reassigned FLINK-3880:
-

Assignee: Maximilian Michels

> Use ConcurrentHashMap for Accumulators
> --
>
> Key: FLINK-3880
> URL: https://issues.apache.org/jira/browse/FLINK-3880
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Ken Krugler
>Assignee: Maximilian Michels
>Priority: Minor
>
> I was looking at improving DataSet performance - this is for a job created 
> using the Cascading-Flink planner for Cascading 3.1.
> While doing a quick "poor man's profiler" session with one of the TaskManager 
> processes, I noticed that many (most?) of the threads that were actually 
> running were in this state:
> {code:java}
> "DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 
> nid=0x666a runnable [0x7f556abcf000]
>java.lang.Thread.State: RUNNABLE
> at java.util.Collections$SynchronizedMap.get(Collections.java:2037)
> - locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap)
> at 
> org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162)
> at 
> org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128)
> at 
> com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122)
> at 
> cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65)
> at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97)
> at 
> cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166)
> at 
> cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139)
> at 
> com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70)
> at 
> com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175)
> at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)}}}
> {code}
> It looks like Cascading is asking Flink to increment a counter with each 
> Tuple read, and that in turn is often blocked on getting access to the 
> Accumulator object in a map. It looks like this is a SynchronizedMap, but 
> using a ConcurrentHashMap (for example) would reduce this contention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274270#comment-15274270
 ] 

Vasia Kalavri commented on FLINK-3879:
--

So, it provides the same functionality as FLINK-2044, but it's more efficient?

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274265#comment-15274265
 ] 

Simone Robutti commented on FLINK-1873:
---

Umh ok. I think that on block-partitioned matrix I will need to perform 
block-wise operations so I think it makes sense to represent the blocks as 
Breeze matrices. 

Anyway, talking about Flink's implementations, why were they implemented in the 
first place if we must rely on Breeze for operations?

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Greg Hogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Hogan updated FLINK-3879:
--
Description: 
Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented 
in [0] and described in [1].

"[HITS] is a very popular and effective algorithm to rank documents based on 
the link information among a set of documents. The algorithm presumes that a 
good hub is a document that points to many others, and a good authority is a 
document that many documents point to." 
[https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]

This implementation differs from FLINK-2044 by providing for convergence, 
outputting both hub and authority scores, and completing in half the number of 
iterations.

[0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
[1] https://en.wikipedia.org/wiki/HITS_algorithm

  was:
Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented 
in [0] and described in [1].

[0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
[1] https://en.wikipedia.org/wiki/HITS_algorithm


> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274254#comment-15274254
 ] 

Vasia Kalavri commented on FLINK-3879:
--

Hi [~greghogan],
Can you please extend the description a bit? How is algorithm different than 
what FLINK-2044 proposes and what's the motivation for adding it to Gelly?
Thanks!

> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274225#comment-15274225
 ] 

ASF GitHub Bot commented on FLINK-3879:
---

GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/1967

[FLINK-3879] [gelly] Native implementation of HITS algorithm



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
3879_native_implementation_of_hits_algorithm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1967


commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe
Author: Greg Hogan 
Date:   2016-05-06T15:13:26Z

[FLINK-3879] [gelly] Native implementation of HITS algorithm




> Native implementation of HITS algorithm
> ---
>
> Key: FLINK-3879
> URL: https://issues.apache.org/jira/browse/FLINK-3879
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3879] [gelly] Native implementation of ...

2016-05-06 Thread greghogan
GitHub user greghogan opened a pull request:

https://github.com/apache/flink/pull/1967

[FLINK-3879] [gelly] Native implementation of HITS algorithm



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/greghogan/flink 
3879_native_implementation_of_hits_algorithm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1967


commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe
Author: Greg Hogan 
Date:   2016-05-06T15:13:26Z

[FLINK-3879] [gelly] Native implementation of HITS algorithm




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-3880) Use ConcurrentHashMap for Accumulators

2016-05-06 Thread Ken Krugler (JIRA)
Ken Krugler created FLINK-3880:
--

 Summary: Use ConcurrentHashMap for Accumulators
 Key: FLINK-3880
 URL: https://issues.apache.org/jira/browse/FLINK-3880
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Ken Krugler
Priority: Minor


I was looking at improving DataSet performance - this is for a job created 
using the Cascading-Flink planner for Cascading 3.1.

While doing a quick "poor man's profiler" session with one of the TaskManager 
processes, I noticed that many (most?) of the threads that were actually 
running were in this state:

{code:java}
"DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 
nid=0x666a runnable [0x7f556abcf000]
   java.lang.Thread.State: RUNNABLE
at java.util.Collections$SynchronizedMap.get(Collections.java:2037)
- locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap)
at 
org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162)
at 
org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113)
at 
com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245)
at 
com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128)
at 
com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122)
at 
cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65)
at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97)
at 
cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166)
at 
cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139)
at 
com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70)
at 
com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175)
at 
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)}}}
{code}

It looks like Cascading is asking Flink to increment a counter with each Tuple 
read, and that in turn is often blocked on getting access to the Accumulator 
object in a map. It looks like this is a SynchronizedMap, but using a 
ConcurrentHashMap (for example) would reduce this contention.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274229#comment-15274229
 ] 

Till Rohrmann commented on FLINK-1873:
--

I think it's good to use Flink's matrix and vector representations as long as 
you don't have to perform operations. For that you can convert Flink's 
primitives into Breeze's primitives. The conversion should almost come for 
free, because Flink uses the same underlying representation of dense and sparse 
matrices/vectors as Breeze. 

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [Flink-3836] Add LongHistogram accumulator

2016-05-06 Thread mbode
GitHub user mbode opened a pull request:

https://github.com/apache/flink/pull/1966

[Flink-3836] Add LongHistogram accumulator

New accumulator `LongHistogram`; the `Histogram` accumulator now throws an 
`IllegalArgumentException` instead of letting the int overflow. 

- [x] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [x] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [x] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbode/flink master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1966


commit f457319481701a1234c9ea7d29da24f857ae4241
Author: Maximilian Bode 
Date:   2016-04-27T15:19:16Z

[Flink-3836] Add LongHistogram accumulator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274177#comment-15274177
 ] 

Simone Robutti commented on FLINK-1873:
---

I began working right away on this issue. 

For now I'm focusing on an indexed row matrix format but I will probably 
implement a partitioned format with the same operations to perform some 
operations in a more straightforward way. I will write conversions from one 
format to the other.

For now I'm just initializing the distributed data structure and writing 
conversions to local formats (COO, Sparse, Dense). I'm doing everything with 
the standards of the local linear algebra package (indices as Int, values as 
Doubles, same names for methods and so on). Also I'm working with Flink's 
implementations of all these classes. Is it ok or should I go directly to 
Breeze's implementations?

Then I will start thinking about common operations (multiplication, dot 
product, svd (?), ATA and so on).

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3879) Native implementation of HITS algorithm

2016-05-06 Thread Greg Hogan (JIRA)
Greg Hogan created FLINK-3879:
-

 Summary: Native implementation of HITS algorithm
 Key: FLINK-3879
 URL: https://issues.apache.org/jira/browse/FLINK-3879
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 1.1.0
Reporter: Greg Hogan
Assignee: Greg Hogan
 Fix For: 1.1.0


Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented 
in [0] and described in [1].

[0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
[1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3211) Add AWS Kinesis streaming connector

2016-05-06 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274119#comment-15274119
 ] 

Robert Metzger commented on FLINK-3211:
---

[~tzulitai] Would you be okay of publishing the kinesis connector as a github 
project? Maybe in the dataArtisans github, as "flink-connectors" or so?

> Add AWS Kinesis streaming connector
> ---
>
> Key: FLINK-3211
> URL: https://issues.apache.org/jira/browse/FLINK-3211
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming Connectors
>Affects Versions: 1.1.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
> cloud service version of Apache Kafka. Support for AWS Kinesis will be a 
> great addition to the handful of Flink's streaming connectors to external 
> systems and a great reach out to the AWS community.
> AWS supports two different ways to consume Kinesis data: with the low-level 
> AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK 
> can be used to consume Kinesis data, including stream read beginning from a 
> specific offset (or "record sequence number" in Kinesis terminology). On the 
> other hand, AWS officially recommends using KCL, which offers a higher-level 
> of abstraction that also comes with checkpointing and failure recovery by 
> using a KCL-managed AWS DynamoDB "leash table" as the checkpoint state 
> storage.
> However, KCL is essentially a stream processing library that wraps all the 
> partition-to-task (or "shard" in Kinesis terminology) determination and 
> checkpointing to allow the user to focus only on streaming application logic. 
> This leads to the understanding that we can not use the KCL to implement the 
> Kinesis streaming connector if we are aiming for a deep integration of Flink 
> with Kinesis that provides exactly once guarantees (KCL promises only 
> at-least-once). Therefore, AWS SDK will be the way to go for the 
> implementation of this feature.
> With the ability to read from specific offsets, and also the fact that 
> Kinesis and Kafka share a lot of similarities, the basic principles of the 
> implementation of Flink's Kinesis streaming connector will very much resemble 
> the Kafka connector. We can basically follow the outlines described in 
> [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector 
> implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka 
> differences is described as following:
> 1. While the Kafka connector can support reading from multiple topics, I 
> currently don't think this is a good idea for Kinesis streams (a Kinesis 
> Stream is logically equivalent to a Kafka topic). Kinesis streams can exist 
> in different AWS regions, and each Kinesis stream under the same AWS user 
> account may have completely independent access settings with different 
> authorization keys. Overall, a Kinesis stream feels like a much more 
> consolidated resource compared to Kafka topics. It would be great to hear 
> more thoughts on this part.
> 2. While Kafka has brokers that can hold multiple partitions, the only 
> partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast 
> to the Kafka connector having per broker connections where the connections 
> can handle multiple Kafka partitions, the Kinesis connector will only need to 
> have simple per shard connections.
> 3. Kinesis itself does not support committing offsets back to Kinesis. If we 
> were to implement this feature like the Kafka connector with Kafka / ZK to 
> sync outside view of progress, we probably could use ZK or DynamoDB like the 
> way KCL works. More thoughts on this part will be very helpful too.
> As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis 
> Producer Library) [5]. However, for higher code consistency with the proposed 
> Kinesis Consumer, I think it will be better to stick with the AWS SDK for the 
> implementation. The implementation should be straight forward, being almost 
> if not completely the same as the Kafka sink.
> References:
> [1] 
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html
> [2] 
> http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
> [3] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
> [4] http://data-artisans.com/kafka-flink-a-practical-how-to/
> [5] 
> http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3211) Add AWS Kinesis streaming connector

2016-05-06 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274018#comment-15274018
 ] 

Robert Metzger commented on FLINK-3211:
---

I've spend quite some time with the Kinesis connector to get it working with 
Flink properly, but there is one big issue: The Kinesis producer library (and 
the client library) both depend on protobuf-java 2.6.1, Flink is using protobuf 
2.5 (mainly forced by Akka).

I tried the following approaches:
- Exclude the protobuf-java dependency from Akka --> Doesn't work because it 
seems to be hardwired somehow
- Upgrade to Akka 2.4.0, which doesn't depend on protobuf anymore --> doesn't 
work because Akka 2.4 depends on Java 8 (and Scala 2.11)
- Shade Kinesis connector's protobuf dependency into the 
"flink-connector-kinesis" jar --> It works (but we can not do it like this due 
to legal restrictions (the Amazon Software License restricts the use of the 
software to AWS services, that's not compatible with the ASF)).

I see the following solutions:
- We merge the kinesis code into master, as an optional module, which is not 
build for releases. People have to build it themselves
- We host the kinesis connector in a separate project (outside of the ASF)
- We ask Amazon to exclude the protobuf dependency.
- I try again using Akka without Protobuf.

> Add AWS Kinesis streaming connector
> ---
>
> Key: FLINK-3211
> URL: https://issues.apache.org/jira/browse/FLINK-3211
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming Connectors
>Affects Versions: 1.1.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
> cloud service version of Apache Kafka. Support for AWS Kinesis will be a 
> great addition to the handful of Flink's streaming connectors to external 
> systems and a great reach out to the AWS community.
> AWS supports two different ways to consume Kinesis data: with the low-level 
> AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK 
> can be used to consume Kinesis data, including stream read beginning from a 
> specific offset (or "record sequence number" in Kinesis terminology). On the 
> other hand, AWS officially recommends using KCL, which offers a higher-level 
> of abstraction that also comes with checkpointing and failure recovery by 
> using a KCL-managed AWS DynamoDB "leash table" as the checkpoint state 
> storage.
> However, KCL is essentially a stream processing library that wraps all the 
> partition-to-task (or "shard" in Kinesis terminology) determination and 
> checkpointing to allow the user to focus only on streaming application logic. 
> This leads to the understanding that we can not use the KCL to implement the 
> Kinesis streaming connector if we are aiming for a deep integration of Flink 
> with Kinesis that provides exactly once guarantees (KCL promises only 
> at-least-once). Therefore, AWS SDK will be the way to go for the 
> implementation of this feature.
> With the ability to read from specific offsets, and also the fact that 
> Kinesis and Kafka share a lot of similarities, the basic principles of the 
> implementation of Flink's Kinesis streaming connector will very much resemble 
> the Kafka connector. We can basically follow the outlines described in 
> [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector 
> implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka 
> differences is described as following:
> 1. While the Kafka connector can support reading from multiple topics, I 
> currently don't think this is a good idea for Kinesis streams (a Kinesis 
> Stream is logically equivalent to a Kafka topic). Kinesis streams can exist 
> in different AWS regions, and each Kinesis stream under the same AWS user 
> account may have completely independent access settings with different 
> authorization keys. Overall, a Kinesis stream feels like a much more 
> consolidated resource compared to Kafka topics. It would be great to hear 
> more thoughts on this part.
> 2. While Kafka has brokers that can hold multiple partitions, the only 
> partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast 
> to the Kafka connector having per broker connections where the connections 
> can handle multiple Kafka partitions, the Kinesis connector will only need to 
> have simple per shard connections.
> 3. Kinesis itself does not support committing offsets back to Kinesis. If we 
> were to implement this feature like the Kafka connector with Kafka / ZK to 
> sync outside view of progress, we probably could use ZK or DynamoDB like the 
> way KCL works. More thoughts on this part will be very helpful too.
> As for 

[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273993#comment-15273993
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62325628
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexOutDegree
+  
+Annotate vertices of a directed graph with the out-degree 
count.
+{% highlight java %}
+DataSet> outDegree = graph
+  .run(new VertexOutDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexDegreePair
+  
+Annotate vertices of a directed graph with both the out-degree 
and in-degree count.
+{% highlight java %}
+DataSet>> pairDegree = graph
+  .run(new VertexDegreePair()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.VertexDegree
+  
+Annotate vertices of an undirected graph with the degree 
count.
+{% highlight java %}
+DataSet> degree = graph
+  .run(new VertexDegree()
+.setIncludeZeroDegreeVertices(true)
+.setReduceOnTargetId(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.EdgeSourceDegree
+  
+Annotate edges of an undirected graph with degree of the source 
ID.
--- End diff --

An `Edge` is always directed but a `Graph` may be undirected if it contains 
a matching reverse of every edge.


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62325628
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexOutDegree
+  
+Annotate vertices of a directed graph with the out-degree 
count.
+{% highlight java %}
+DataSet> outDegree = graph
+  .run(new VertexOutDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexDegreePair
+  
+Annotate vertices of a directed graph with both the out-degree 
and in-degree count.
+{% highlight java %}
+DataSet>> pairDegree = graph
+  .run(new VertexDegreePair()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.VertexDegree
+  
+Annotate vertices of an undirected graph with the degree 
count.
+{% highlight java %}
+DataSet> degree = graph
+  .run(new VertexDegree()
+.setIncludeZeroDegreeVertices(true)
+.setReduceOnTargetId(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.EdgeSourceDegree
+  
+Annotate edges of an undirected graph with degree of the source 
ID.
--- End diff --

An `Edge` is always directed but a `Graph` may be undirected if it contains 
a matching reverse of every edge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3754) Add a validation phase before construct RelNode using TableAPI

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273989#comment-15273989
 ] 

ASF GitHub Bot commented on FLINK-3754:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/1958#issuecomment-217430461
  
I took a brief look at the PR. The overall structure of the validation 
looks good to me. Will do a more in-depth review in the next days. @twalthr, 
can you have another look as well?
Thanks, Fabian


> Add a validation phase before construct RelNode using TableAPI
> --
>
> Key: FLINK-3754
> URL: https://issues.apache.org/jira/browse/FLINK-3754
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Assignee: Yijie Shen
>
> Unlike sql string's execution, which have a separate validation phase before 
> RelNode construction, Table API lacks the counterparts and the validation is 
> scattered in many places.
> I suggest to add a single validation phase and detect problems as early as 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3754][Table]Add a validation phase befo...

2016-05-06 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/1958#issuecomment-217430461
  
I took a brief look at the PR. The overall structure of the validation 
looks good to me. Will do a more in-depth review in the next days. @twalthr, 
can you have another look as well?
Thanks, Fabian


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273983#comment-15273983
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user greghogan commented on the pull request:

https://github.com/apache/flink/pull/1901#issuecomment-21742
  
@vasia I had looked at one of the failed tests and it wasn't due to bad 
logic but rather to running out of some minimal number of memory buffers. This 
results, I expect, from updating the `Graph` API's degree functions to 
translate the output of the degree algorithms. I think it is best to leave the 
`Graph` API as-is for now.


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread greghogan
Github user greghogan commented on the pull request:

https://github.com/apache/flink/pull/1901#issuecomment-21742
  
@vasia I had looked at one of the failed tests and it wasn't due to bad 
logic but rather to running out of some minimal number of memory buffers. This 
results, I expect, from updating the `Graph` API's degree functions to 
translate the output of the degree algorithms. I think it is best to leave the 
`Graph` API as-is for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273945#comment-15273945
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62320682
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

Thanks for the tip @tillrohrmann . Let me see how I can adapt the 
CaseClassTypeInfo for SelectByMax/Min function. 


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...

2016-05-06 Thread ramkrish86
Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62320682
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

Thanks for the tip @tillrohrmann . Let me see how I can adapt the 
CaseClassTypeInfo for SelectByMax/Min function. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273941#comment-15273941
 ] 

ASF GitHub Bot commented on FLINK-2044:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1956#issuecomment-217419465
  
Hi @gallenvara, @greghogan,

- If it's possible to return both the hub and the authority value, I'd 
prefer that.

- GSA iterations allow setting the edge direction as Greg suggested. I'm 
not sure how much of a difference the combiner would make. Also, we've seen 
that for some graphs scatter-gather performs better. Personally, I would be 
fine with a first scatter-gather version of the algorithm. We can run some 
tests to see whether GSA would be faster later.

- I agree that edge values should be internally set to `NullValue` if not 
used.


> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Ahamd Javid
>Assignee: GaoLun
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. the feature branch 
> can be found here: (https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-06 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1956#issuecomment-217419465
  
Hi @gallenvara, @greghogan,

- If it's possible to return both the hub and the authority value, I'd 
prefer that.

- GSA iterations allow setting the edge direction as Greg suggested. I'm 
not sure how much of a difference the combiner would make. Also, we've seen 
that for some graphs scatter-gather performs better. Personally, I would be 
fine with a first scatter-gather version of the algorithm. We can run some 
tests to see whether GSA would be faster later.

- I agree that edge values should be internally set to `NullValue` if not 
used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2926) Add a Strongly Connected Components Library Method

2016-05-06 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273933#comment-15273933
 ] 

Vasia Kalavri commented on FLINK-2926:
--

Hi [~mliesenberg],
delta iteration by default finishes when the workset is empty, but I don't see 
why it couldn't support a custom convergence criterion also. I thought this 
method was there already. In fact the [iterations 
guide|https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/iterations.html]
 states that the delta iteration supports "Custom aggregator convergence" so 
that's weird. Can you please open an issue for that? Thanks!

> Add a Strongly Connected Components Library Method
> --
>
> Key: FLINK-2926
> URL: https://issues.apache.org/jira/browse/FLINK-2926
> Project: Flink
>  Issue Type: Task
>  Components: Gelly
>Affects Versions: 0.10.0
>Reporter: Andra Lungu
>Assignee: Martin Liesenberg
>Priority: Minor
>  Labels: requires-design-doc
>
> This algorithm operates in four main steps: 
> 1). Form the transposed graph (each vertex sends its id to its out neighbors 
> which form a transposedNeighbors set)
> 2). Trimming: every vertex which has only incoming or outgoing edges sets 
> colorID to its own value and becomes inactive. 
> 3). Forward traversal: 
>Start phase: propagate id to out neighbors 
>Rest phase: update the colorID with the minimum value seen 
> until convergence
> 4). Backward traversal: 
>  Start: if the vertex id is equal to its color id 
> propagate the value to transposedNeighbors
>  Rest: each vertex that receives a message equal to its 
> colorId will propagate its colorId to the transposed graph and becomes 
> inactive. 
> More info in section 3.1 of this paper: 
> http://ilpubs.stanford.edu:8090/1077/3/p535-salihoglu.pdf
> or in section 6 of this paper: http://www.vldb.org/pvldb/vol7/p1821-yan.pdf  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1901#issuecomment-217403403
  
@greghogan, after your last commit, there are several gelly test cases 
failing on travis. Could you please take a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273868#comment-15273868
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1901#issuecomment-217403403
  
@greghogan, after your last commit, there are several gelly test cases 
failing on travis. Could you please take a look? Thanks!


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273867#comment-15273867
 ] 

Fabian Hueske commented on FLINK-1873:
--

[~chobeat], I gave you contributor permissions for JIRA as well. You can now 
assign issues to yourself. 

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske updated FLINK-1873:
-
Assignee: Simone Robutti

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...

2016-05-06 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62310212
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

The `CaseClassTypeInfo` is the type information which is created for Scala 
tuples, if I'm not mistaken. And all Scala tuples are of type `Product`. With 
that you should be able to adapt the `SelectByMax/MinFunction`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread vasia
Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310380
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
--- End diff --

What do you mean by in-degree "count"? What's the result of this operation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...

2016-05-06 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62309620
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

I think the purpose of Flink-3650 is adding support for Scala tuples. Thus, 
I would rather drop support for Java tuples in the Scala API than for Scala 
tuples. I would assume that you have to implement a Scala specific 
`SelectByMax/MinFunction` to support Scala tuples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273854#comment-15273854
 ] 

Simone Robutti commented on FLINK-1873:
---

Sure. I will have to study a bit to do it properly but I was already going to 
do something like that for an algorithm I'm implementing (MinHash).

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: liaoyuxi
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273860#comment-15273860
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310380
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
--- End diff --

What do you mean by in-degree "count"? What's the result of this operation?


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273863#comment-15273863
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310412
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexOutDegree
+  
+Annotate vertices of a directed graph with the out-degree 
count.
+{% highlight java %}
+DataSet> outDegree = graph
+  .run(new VertexOutDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexDegreePair
+  
+Annotate vertices of a directed graph with both the out-degree 
and in-degree count.
+{% highlight java %}
+DataSet>> pairDegree = graph
+  .run(new VertexDegreePair()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.VertexDegree
+  
+Annotate vertices of an undirected graph with the degree 
count.
+{% highlight java %}
+DataSet> degree = graph
+  .run(new VertexDegree()
+.setIncludeZeroDegreeVertices(true)
+.setReduceOnTargetId(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.EdgeSourceDegree
+  
+Annotate edges of an undirected graph with degree of the source 
ID.
--- End diff --

When you say "undirected" graph, do you mean the algorithm does not take 
into account the edge direction or does the input graph have to be undirected? 
Since gelly graphs are in fact always directed and we simply add 
opposite-direction edges to represent undirected graphs, we should be clear 
about this. Also, which is the source vertex is an undirected edge?


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread vasia
Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310412
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexOutDegree
+  
+Annotate vertices of a directed graph with the out-degree 
count.
+{% highlight java %}
+DataSet> outDegree = graph
+  .run(new VertexOutDegree()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.directed.VertexDegreePair
+  
+Annotate vertices of a directed graph with both the out-degree 
and in-degree count.
+{% highlight java %}
+DataSet>> pairDegree = graph
+  .run(new VertexDegreePair()
+.setIncludeZeroDegreeVertices(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.VertexDegree
+  
+Annotate vertices of an undirected graph with the degree 
count.
+{% highlight java %}
+DataSet> degree = graph
+  .run(new VertexDegree()
+.setIncludeZeroDegreeVertices(true)
+.setReduceOnTargetId(true));
+{% endhighlight %}
+  
+
+
+
+  
degree.annotate.undirected.EdgeSourceDegree
+  
+Annotate edges of an undirected graph with degree of the source 
ID.
--- End diff --

When you say "undirected" graph, do you mean the algorithm does not take 
into account the edge direction or does the input graph have to be undirected? 
Since gelly graphs are in fact always directed and we simply add 
opposite-direction edges to represent undirected graphs, we should be clear 
about this. Also, which is the source vertex is an undirected edge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273862#comment-15273862
 ] 

ASF GitHub Bot commented on FLINK-3772:
---

Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310388
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
--- End diff --

What does the `setIncludeZeroDegreeVertices` do and what other options are 
available? Could you please document all of them (for the rest of algorithms 
also)?


> Graph algorithms for vertex and edge degree
> ---
>
> Key: FLINK-3772
> URL: https://issues.apache.org/jira/browse/FLINK-3772
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> Many graph algorithms require vertices or edges to be marked with the degree. 
> This ticket provides algorithms for annotating
> * vertex degree for undirected graphs
> * vertex out-, in-, and out- and in-degree for directed graphs
> * edge source, target, and source and target degree for undirected graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...

2016-05-06 Thread vasia
Github user vasia commented on a diff in the pull request:

https://github.com/apache/flink/pull/1901#discussion_r62310388
  
--- Diff: docs/apis/batch/libs/gelly.md ---
@@ -2067,7 +2067,92 @@ configuration.
 
   
 
-  TranslateGraphIds
+  
degree.annotate.directed.VertexInDegree
+  
+Annotate vertices of a directed graph with the in-degree 
count.
+{% highlight java %}
+DataSet> inDegree = graph
+  .run(new VertexInDegree()
+.setIncludeZeroDegreeVertices(true));
--- End diff --

What does the `setIncludeZeroDegreeVertices` do and what other options are 
available? Could you please document all of them (for the rest of algorithms 
also)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273858#comment-15273858
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62310212
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

The `CaseClassTypeInfo` is the type information which is created for Scala 
tuples, if I'm not mistaken. And all Scala tuples are of type `Product`. With 
that you should be able to adapt the `SelectByMax/MinFunction`.


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273852#comment-15273852
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62309620
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

I think the purpose of Flink-3650 is adding support for Scala tuples. Thus, 
I would rather drop support for Java tuples in the Scala API than for Scala 
tuples. I would assume that you have to implement a Scala specific 
`SelectByMax/MinFunction` to support Scala tuples.


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273826#comment-15273826
 ] 

Fabian Hueske commented on FLINK-1873:
--

I agree, looks like the issue is abandoned. 
[~chobeat], if you want to work on this issue, I can assign it to you.

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: liaoyuxi
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1502] [WIP] Basic Metric System

2016-05-06 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/1947#issuecomment-217389510
  
Okay. I think we need to support multiple instances of the same job on a 
TaskManager.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273812#comment-15273812
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/1947#issuecomment-217389510
  
Okay. I think we need to support multiple instances of the same job on a 
TaskManager.


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3857][Streaming Connectors]Add reconnec...

2016-05-06 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/1962#issuecomment-217373539
  
How did you test the code you've implemented in this pull request?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3857) Add reconnect attempt to Elasticsearch host

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273771#comment-15273771
 ] 

ASF GitHub Bot commented on FLINK-3857:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/1962#issuecomment-217373539
  
How did you test the code you've implemented in this pull request?


> Add reconnect attempt to Elasticsearch host
> ---
>
> Key: FLINK-3857
> URL: https://issues.apache.org/jira/browse/FLINK-3857
> Project: Flink
>  Issue Type: Improvement
>  Components: Streaming Connectors
>Affects Versions: 1.1.0, 1.0.2
>Reporter: Fabian Hueske
>Assignee: Subhobrata Dey
>
> Currently, the connection to the Elasticsearch host is opened in 
> {{ElasticsearchSink.open()}}. In case the connection is lost (maybe due to a 
> changed DNS entry), the sink fails.
> I propose to catch the Exception for lost connections in the {{invoke()}} 
> method and try to re-open the connection for a configurable number of times 
> with a certain delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-06 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273770#comment-15273770
 ] 

Simone Robutti commented on FLINK-1873:
---

This is issue has been dead for one year. What about reassigning it? I think it 
would help implement many algorithms and right now people needs to implement 
their distributed operation on matrices everytime.

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: liaoyuxi
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1502] [WIP] Basic Metric System

2016-05-06 Thread zentol
Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/1947#issuecomment-217368732
  
@rmetzger you are correct that you're job failed since the previous one 
wasn't cleaned up yet. Should you try to run 2 identical jobs in parallel it 
will fail, since 2 jobs would use the same metrics due to name clashes. Note 
that in this version this also occurs when 2 operators have the same name. I 
have some additional functionality coming up that would allow you to circumvent 
this issue.

@ankitcha The problem with multiple reporters is our configuration, it only 
supports single-line key-value pairs, and you need to know the exact key to 
access it. In order to configure multiple reporters you would either need a 
nested structure (which is not supportet), or index the configuration keys 
(metrics.reporter.1.class) and add a new parameter containing the indices to 
use (e.g. metrics.reporter: 0, 1), which isn't particularly user-friendly. The 
metric system itself could deal with multiple reporters with minor 
modifications; it's all about the configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273740#comment-15273740
 ] 

ASF GitHub Bot commented on FLINK-1502:
---

Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/1947#issuecomment-217368732
  
@rmetzger you are correct that you're job failed since the previous one 
wasn't cleaned up yet. Should you try to run 2 identical jobs in parallel it 
will fail, since 2 jobs would use the same metrics due to name clashes. Note 
that in this version this also occurs when 2 operators have the same name. I 
have some additional functionality coming up that would allow you to circumvent 
this issue.

@ankitcha The problem with multiple reporters is our configuration, it only 
supports single-line key-value pairs, and you need to know the exact key to 
access it. In order to configure multiple reporters you would either need a 
nested structure (which is not supportet), or index the configuration keys 
(metrics.reporter.1.class) and add a new parameter containing the indices to 
use (e.g. metrics.reporter: 0, 1), which isn't particularly user-friendly. The 
metric system itself could deal with multiple reporters with minor 
modifications; it's all about the configuration.


> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Chesnay Schepler
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API

2016-05-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273738#comment-15273738
 ] 

ASF GitHub Bot commented on FLINK-3650:
---

Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62296025
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

@tillrohrmann 
Any suggestions here?  Should we handle the scala Tuple in another JIRA?  
Am not an expert in Scala. Just started working with it. So if it can be 
handled in another JIRA, this PR can be integrated. 


> Add maxBy/minBy to Scala DataSet API
> 
>
> Key: FLINK-3650
> URL: https://issues.apache.org/jira/browse/FLINK-3650
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 1.1.0
>Reporter: Till Rohrmann
>Assignee: ramkrishna.s.vasudevan
>
> The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. 
> These methods are not supported by the Scala DataSet API. These methods 
> should be added in order to have a consistent API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...

2016-05-06 Thread ramkrish86
Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/1856#discussion_r62296025
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java
 ---
@@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception {
for (int position : fields) {
// Save position of compared key
// Get both values - both implement comparable
-   Comparable comparable1 = 
value1.getFieldNotNull(position);
-   Comparable comparable2 = 
value2.getFieldNotNull(position);
+   Comparable comparable1 = 
((Tuple)value1).getFieldNotNull(position);
--- End diff --

@tillrohrmann 
Any suggestions here?  Should we handle the scala Tuple in another JIRA?  
Am not an expert in Scala. Just started working with it. So if it can be 
handled in another JIRA, this PR can be integrated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---