[jira] [Updated] (FLINK-3801) Upgrade Joda-Time library to 2.9.3
[ https://issues.apache.org/jira/browse/FLINK-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated FLINK-3801: -- Description: Currently yoda-time 2.5 is used which was very old. We should upgrade to 2.9.3 was: Currently yoda-time 2.5 is used which was very old. We should upgrade to 2.9.3 > Upgrade Joda-Time library to 2.9.3 > -- > > Key: FLINK-3801 > URL: https://issues.apache.org/jira/browse/FLINK-3801 > Project: Flink > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > > Currently yoda-time 2.5 is used which was very old. > We should upgrade to 2.9.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-3753) KillerWatchDog should not use kill on toKill thread
[ https://issues.apache.org/jira/browse/FLINK-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated FLINK-3753: -- Description: {code} // this is harsh, but this watchdog is a last resort if (toKill.isAlive()) { toKill.stop(); } {code} stop() is deprecated. See: https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads was: {code} // this is harsh, but this watchdog is a last resort if (toKill.isAlive()) { toKill.stop(); } {code} stop() is deprecated. See: https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads > KillerWatchDog should not use kill on toKill thread > --- > > Key: FLINK-3753 > URL: https://issues.apache.org/jira/browse/FLINK-3753 > Project: Flink > Issue Type: Bug >Reporter: Ted Yu >Priority: Minor > > {code} > // this is harsh, but this watchdog is a last resort > if (toKill.isAlive()) { > toKill.stop(); > } > {code} > stop() is deprecated. > See: > https://www.securecoding.cert.org/confluence/display/java/THI05-J.+Do+not+use+Thread.stop()+to+terminate+threads -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274657#comment-15274657 ] ASF GitHub Bot commented on FLINK-3772: --- Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62384029 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. --- End diff -- I am removing "count" from the descriptions since "degree" implies a count. > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62384029 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. --- End diff -- I am removing "count" from the descriptions since "degree" implies a count. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3857) Add reconnect attempt to Elasticsearch host
[ https://issues.apache.org/jira/browse/FLINK-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274627#comment-15274627 ] ASF GitHub Bot commented on FLINK-3857: --- Github user sbcd90 commented on the pull request: https://github.com/apache/flink/pull/1962#issuecomment-217540405 Hello @rmetzger , I added a testcase now to the `ElasticsearchSinkITCase.java` list of tests. Can you kindly have a look once? > Add reconnect attempt to Elasticsearch host > --- > > Key: FLINK-3857 > URL: https://issues.apache.org/jira/browse/FLINK-3857 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.1.0, 1.0.2 >Reporter: Fabian Hueske >Assignee: Subhobrata Dey > > Currently, the connection to the Elasticsearch host is opened in > {{ElasticsearchSink.open()}}. In case the connection is lost (maybe due to a > changed DNS entry), the sink fails. > I propose to catch the Exception for lost connections in the {{invoke()}} > method and try to re-open the connection for a configurable number of times > with a certain delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3857][Streaming Connectors]Add reconnec...
Github user sbcd90 commented on the pull request: https://github.com/apache/flink/pull/1962#issuecomment-217540405 Hello @rmetzger , I added a testcase now to the `ElasticsearchSinkITCase.java` list of tests. Can you kindly have a look once? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm
[ https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274561#comment-15274561 ] ASF GitHub Bot commented on FLINK-2044: --- Github user gallenvara commented on the pull request: https://github.com/apache/flink/pull/1956#issuecomment-217527593 Hi, @vasia @greghogan , - code modified and support to return both hub and authority score as `Tuple2` type now. The implementation will run extra one iteration to normalize the hub value in the end. With scatter-gather being implemented, the GSA version is easy to write. - As for the threshold, the algorithm should check neighboring hub or authority iteration whether there are vertex updating. It's a little difficult and from my view, the` maxIteration` parameter play a similar role with threshold. - I will open another issue for GSA version and relevant tests. > Implementation of Gelly HITS Algorithm > -- > > Key: FLINK-2044 > URL: https://issues.apache.org/jira/browse/FLINK-2044 > Project: Flink > Issue Type: New Feature > Components: Gelly >Reporter: Ahamd Javid >Assignee: GaoLun >Priority: Minor > > Implementation of Hits Algorithm in Gelly API using Java. the feature branch > can be found here: (https://github.com/JavidMayar/flink/commits/HITS) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-3618) Rename abstract UDF classes in Scatter-Gather implementation
[ https://issues.apache.org/jira/browse/FLINK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Hogan reassigned FLINK-3618: - Assignee: Greg Hogan > Rename abstract UDF classes in Scatter-Gather implementation > > > Key: FLINK-3618 > URL: https://issues.apache.org/jira/browse/FLINK-3618 > Project: Flink > Issue Type: Improvement > Components: Gelly >Affects Versions: 1.1.0, 1.0.1 >Reporter: Martin Junghanns >Assignee: Greg Hogan >Priority: Minor > > We now offer three Vertex-centric computing abstractions: > * Pregel > * Gather-Sum-Apply > * Scatter-Gather > Each of these abstractions provides abstract classes that need to be > implemented by the user: > * Pregel: {{ComputeFunction}} > * GSA: {{GatherFunction}}, {{SumFunction}}, {{ApplyFunction}} > * Scatter-Gather: {{MessagingFunction}}, {{VertexUpdateFunction}} > In Pregel and GSA, the names of those functions follow the name of the > abstraction or the name suggested in the corresponding papers. For > consistency of the API, I propose to rename {{MessageFunction}} to > {{ScatterFunction}} and {{VertexUpdateFunction}} to {{GatherFunction}}. > Also for consistency, I would like to change the parameter order in > {{Graph.runScatterGatherIteration(VertexUpdateFunction f1, MessagingFunction > f2}} to {{Graph.runScatterGatherIteration(ScatterFunction f1, GatherFunction > f2}} (like in {{Graph.runGatherSumApplyFunction(...)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274400#comment-15274400 ] ASF GitHub Bot commented on FLINK-1502: --- Github user ankitcha commented on the pull request: https://github.com/apache/flink/pull/1947#issuecomment-217507497 @rmetzger thanks for the response. I think I got what you explained and I agree that using index in configuration keys won't be a nice experience. But, can we support nested structure in flink conf? I am unsure about the scope of this change, so maybe its a bad suggestion. But, this is something that will really help me out to put our application in production and we have to use multiple reporters as part of our infrastructure requirements. > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: pre-apache > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3877) Create TranslateFunction interface for Graph translators
[ https://issues.apache.org/jira/browse/FLINK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274331#comment-15274331 ] ASF GitHub Bot commented on FLINK-3877: --- GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/1968 [FLINK-3877] [gelly] Create TranslateFunction interface for Graph translators The TranslateFunction interface is similar to MapFunction but may be called multiple times before serialization. You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 3877_create_translatefunction_interface_for_graph_translators Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1968.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1968 commit b1d838a276c9c0f2e6b79b34f924cc42c93c371d Author: Greg HoganDate: 2016-05-04T20:51:23Z [FLINK-3877] [gelly] Create TranslateFunction interface for Graph translators The TranslateFunction interface is similar to MapFunction but may be called multiple times before serialization. > Create TranslateFunction interface for Graph translators > > > Key: FLINK-3877 > URL: https://issues.apache.org/jira/browse/FLINK-3877 > Project: Flink > Issue Type: Bug > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan >Priority: Minor > Fix For: 1.1.0 > > > I now recall why FLINK-3771 had a {{Translator}} interface with a > {{translate}} method taking a field for reuse: when we translate edge ID the > translator must be called twice. > {{TranslateFunction}} will be modeled after {{MapFunction}} and > {{RichTranslateFunction}} will be modeled after {{RichMapFunction}}. > The unit test should have caught this but I was reusing values between fields > which did not detect that values were overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274307#comment-15274307 ] Greg Hogan commented on FLINK-3879: --- I think FLINK-2044 can test for convergence if the last three scores are stored for each vertex. That would also make it trivial to return both scores. So, yes, performance. > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3880) Use ConcurrentHashMap for Accumulators
[ https://issues.apache.org/jira/browse/FLINK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274298#comment-15274298 ] Maximilian Michels commented on FLINK-3880: --- You're right, the synchronized map is a bottle neck. Actually, it is not even necessary that it synchronizes. In a regular Flink job, it can only be accessed by one task at a time. Only if the user spawned additional threads, it could be concurrently modified. In this case the user would have to take care of the synchronization (and if not get a ConcurrentModificationException). So we can simply make it a normal map. > Use ConcurrentHashMap for Accumulators > -- > > Key: FLINK-3880 > URL: https://issues.apache.org/jira/browse/FLINK-3880 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Ken Krugler >Priority: Minor > > I was looking at improving DataSet performance - this is for a job created > using the Cascading-Flink planner for Cascading 3.1. > While doing a quick "poor man's profiler" session with one of the TaskManager > processes, I noticed that many (most?) of the threads that were actually > running were in this state: > {code:java} > "DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 > nid=0x666a runnable [0x7f556abcf000] >java.lang.Thread.State: RUNNABLE > at java.util.Collections$SynchronizedMap.get(Collections.java:2037) > - locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap) > at > org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162) > at > org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122) > at > cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65) > at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97) > at > cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166) > at > cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139) > at > com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70) > at > com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175) > at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745)}}} > {code} > It looks like Cascading is asking Flink to increment a counter with each > Tuple read, and that in turn is often blocked on getting access to the > Accumulator object in a map. It looks like this is a SynchronizedMap, but > using a ConcurrentHashMap (for example) would reduce this contention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-3880) Use ConcurrentHashMap for Accumulators
[ https://issues.apache.org/jira/browse/FLINK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels reassigned FLINK-3880: - Assignee: Maximilian Michels > Use ConcurrentHashMap for Accumulators > -- > > Key: FLINK-3880 > URL: https://issues.apache.org/jira/browse/FLINK-3880 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Ken Krugler >Assignee: Maximilian Michels >Priority: Minor > > I was looking at improving DataSet performance - this is for a job created > using the Cascading-Flink planner for Cascading 3.1. > While doing a quick "poor man's profiler" session with one of the TaskManager > processes, I noticed that many (most?) of the threads that were actually > running were in this state: > {code:java} > "DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 > nid=0x666a runnable [0x7f556abcf000] >java.lang.Thread.State: RUNNABLE > at java.util.Collections$SynchronizedMap.get(Collections.java:2037) > - locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap) > at > org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162) > at > org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128) > at > com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122) > at > cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65) > at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97) > at > cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166) > at > cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139) > at > com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70) > at > com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175) > at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745)}}} > {code} > It looks like Cascading is asking Flink to increment a counter with each > Tuple read, and that in turn is often blocked on getting access to the > Accumulator object in a map. It looks like this is a SynchronizedMap, but > using a ConcurrentHashMap (for example) would reduce this contention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274270#comment-15274270 ] Vasia Kalavri commented on FLINK-3879: -- So, it provides the same functionality as FLINK-2044, but it's more efficient? > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274265#comment-15274265 ] Simone Robutti commented on FLINK-1873: --- Umh ok. I think that on block-partitioned matrix I will need to perform block-wise operations so I think it makes sense to represent the blocks as Breeze matrices. Anyway, talking about Flink's implementations, why were they implemented in the first place if we must rely on Breeze for operations? > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Hogan updated FLINK-3879: -- Description: Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented in [0] and described in [1]. "[HITS] is a very popular and effective algorithm to rank documents based on the link information among a set of documents. The algorithm presumes that a good hub is a document that points to many others, and a good authority is a document that many documents point to." [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] This implementation differs from FLINK-2044 by providing for convergence, outputting both hub and authority scores, and completing in half the number of iterations. [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf [1] https://en.wikipedia.org/wiki/HITS_algorithm was: Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented in [0] and described in [1]. [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf [1] https://en.wikipedia.org/wiki/HITS_algorithm > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > "[HITS] is a very popular and effective algorithm to rank documents based on > the link information among a set of documents. The algorithm presumes that a > good hub is a document that points to many others, and a good authority is a > document that many documents point to." > [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf] > This implementation differs from FLINK-2044 by providing for convergence, > outputting both hub and authority scores, and completing in half the number > of iterations. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274254#comment-15274254 ] Vasia Kalavri commented on FLINK-3879: -- Hi [~greghogan], Can you please extend the description a bit? How is algorithm different than what FLINK-2044 proposes and what's the motivation for adding it to Gelly? Thanks! > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm
[ https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274225#comment-15274225 ] ASF GitHub Bot commented on FLINK-3879: --- GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/1967 [FLINK-3879] [gelly] Native implementation of HITS algorithm You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 3879_native_implementation_of_hits_algorithm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1967.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1967 commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe Author: Greg HoganDate: 2016-05-06T15:13:26Z [FLINK-3879] [gelly] Native implementation of HITS algorithm > Native implementation of HITS algorithm > --- > > Key: FLINK-3879 > URL: https://issues.apache.org/jira/browse/FLINK-3879 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is > presented in [0] and described in [1]. > [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf > [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3879] [gelly] Native implementation of ...
GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/1967 [FLINK-3879] [gelly] Native implementation of HITS algorithm You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 3879_native_implementation_of_hits_algorithm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1967.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1967 commit 724a86b5b5e7e7a93392048d5842fe54df7c4bfe Author: Greg HoganDate: 2016-05-06T15:13:26Z [FLINK-3879] [gelly] Native implementation of HITS algorithm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-3880) Use ConcurrentHashMap for Accumulators
Ken Krugler created FLINK-3880: -- Summary: Use ConcurrentHashMap for Accumulators Key: FLINK-3880 URL: https://issues.apache.org/jira/browse/FLINK-3880 Project: Flink Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Ken Krugler Priority: Minor I was looking at improving DataSet performance - this is for a job created using the Cascading-Flink planner for Cascading 3.1. While doing a quick "poor man's profiler" session with one of the TaskManager processes, I noticed that many (most?) of the threads that were actually running were in this state: {code:java} "DataSource (/working1/terms) (8/20)" daemon prio=10 tid=0x7f55673e0800 nid=0x666a runnable [0x7f556abcf000] java.lang.Thread.State: RUNNABLE at java.util.Collections$SynchronizedMap.get(Collections.java:2037) - locked <0x0006e73fe718> (a java.util.Collections$SynchronizedMap) at org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getAccumulator(AbstractRuntimeUDFContext.java:162) at org.apache.flink.api.common.functions.util.AbstractRuntimeUDFContext.getLongCounter(AbstractRuntimeUDFContext.java:113) at com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.getOrInitCounter(FlinkFlowProcess.java:245) at com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:128) at com.dataartisans.flink.cascading.runtime.util.FlinkFlowProcess.increment(FlinkFlowProcess.java:122) at cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:65) at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:97) at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:166) at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:139) at com.dataartisans.flink.cascading.runtime.source.TapSourceStage.readNextRecord(TapSourceStage.java:70) at com.dataartisans.flink.cascading.runtime.source.TapInputFormat.reachedEnd(TapInputFormat.java:175) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745)}}} {code} It looks like Cascading is asking Flink to increment a counter with each Tuple read, and that in turn is often blocked on getting access to the Accumulator object in a map. It looks like this is a SynchronizedMap, but using a ConcurrentHashMap (for example) would reduce this contention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274229#comment-15274229 ] Till Rohrmann commented on FLINK-1873: -- I think it's good to use Flink's matrix and vector representations as long as you don't have to perform operations. For that you can convert Flink's primitives into Breeze's primitives. The conversion should almost come for free, because Flink uses the same underlying representation of dense and sparse matrices/vectors as Breeze. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink-3836] Add LongHistogram accumulator
GitHub user mbode opened a pull request: https://github.com/apache/flink/pull/1966 [Flink-3836] Add LongHistogram accumulator New accumulator `LongHistogram`; the `Histogram` accumulator now throws an `IllegalArgumentException` instead of letting the int overflow. - [x] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [x] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [x] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/mbode/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1966.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1966 commit f457319481701a1234c9ea7d29da24f857ae4241 Author: Maximilian BodeDate: 2016-04-27T15:19:16Z [Flink-3836] Add LongHistogram accumulator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274177#comment-15274177 ] Simone Robutti commented on FLINK-1873: --- I began working right away on this issue. For now I'm focusing on an indexed row matrix format but I will probably implement a partitioned format with the same operations to perform some operations in a more straightforward way. I will write conversions from one format to the other. For now I'm just initializing the distributed data structure and writing conversions to local formats (COO, Sparse, Dense). I'm doing everything with the standards of the local linear algebra package (indices as Int, values as Doubles, same names for methods and so on). Also I'm working with Flink's implementations of all these classes. Is it ok or should I go directly to Breeze's implementations? Then I will start thinking about common operations (multiplication, dot product, svd (?), ATA and so on). > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3879) Native implementation of HITS algorithm
Greg Hogan created FLINK-3879: - Summary: Native implementation of HITS algorithm Key: FLINK-3879 URL: https://issues.apache.org/jira/browse/FLINK-3879 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 1.1.0 Reporter: Greg Hogan Assignee: Greg Hogan Fix For: 1.1.0 Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is presented in [0] and described in [1]. [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf [1] https://en.wikipedia.org/wiki/HITS_algorithm -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3211) Add AWS Kinesis streaming connector
[ https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274119#comment-15274119 ] Robert Metzger commented on FLINK-3211: --- [~tzulitai] Would you be okay of publishing the kinesis connector as a github project? Maybe in the dataArtisans github, as "flink-connectors" or so? > Add AWS Kinesis streaming connector > --- > > Key: FLINK-3211 > URL: https://issues.apache.org/jira/browse/FLINK-3211 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Affects Versions: 1.1.0 >Reporter: Tzu-Li (Gordon) Tai >Assignee: Tzu-Li (Gordon) Tai > Original Estimate: 336h > Remaining Estimate: 336h > > AWS Kinesis is a widely adopted message queue used by AWS users, much like a > cloud service version of Apache Kafka. Support for AWS Kinesis will be a > great addition to the handful of Flink's streaming connectors to external > systems and a great reach out to the AWS community. > AWS supports two different ways to consume Kinesis data: with the low-level > AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK > can be used to consume Kinesis data, including stream read beginning from a > specific offset (or "record sequence number" in Kinesis terminology). On the > other hand, AWS officially recommends using KCL, which offers a higher-level > of abstraction that also comes with checkpointing and failure recovery by > using a KCL-managed AWS DynamoDB "leash table" as the checkpoint state > storage. > However, KCL is essentially a stream processing library that wraps all the > partition-to-task (or "shard" in Kinesis terminology) determination and > checkpointing to allow the user to focus only on streaming application logic. > This leads to the understanding that we can not use the KCL to implement the > Kinesis streaming connector if we are aiming for a deep integration of Flink > with Kinesis that provides exactly once guarantees (KCL promises only > at-least-once). Therefore, AWS SDK will be the way to go for the > implementation of this feature. > With the ability to read from specific offsets, and also the fact that > Kinesis and Kafka share a lot of similarities, the basic principles of the > implementation of Flink's Kinesis streaming connector will very much resemble > the Kafka connector. We can basically follow the outlines described in > [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector > implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka > differences is described as following: > 1. While the Kafka connector can support reading from multiple topics, I > currently don't think this is a good idea for Kinesis streams (a Kinesis > Stream is logically equivalent to a Kafka topic). Kinesis streams can exist > in different AWS regions, and each Kinesis stream under the same AWS user > account may have completely independent access settings with different > authorization keys. Overall, a Kinesis stream feels like a much more > consolidated resource compared to Kafka topics. It would be great to hear > more thoughts on this part. > 2. While Kafka has brokers that can hold multiple partitions, the only > partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast > to the Kafka connector having per broker connections where the connections > can handle multiple Kafka partitions, the Kinesis connector will only need to > have simple per shard connections. > 3. Kinesis itself does not support committing offsets back to Kinesis. If we > were to implement this feature like the Kafka connector with Kafka / ZK to > sync outside view of progress, we probably could use ZK or DynamoDB like the > way KCL works. More thoughts on this part will be very helpful too. > As for the Kinesis Sink, it should be possible to use the AWS KPL (Kinesis > Producer Library) [5]. However, for higher code consistency with the proposed > Kinesis Consumer, I think it will be better to stick with the AWS SDK for the > implementation. The implementation should be straight forward, being almost > if not completely the same as the Kafka sink. > References: > [1] > http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-sdk.html > [2] > http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html > [3] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html > [4] http://data-artisans.com/kafka-flink-a-practical-how-to/ > [5] > http://docs.aws.amazon.com//kinesis/latest/dev/developing-producers-with-kpl.html#d0e4998 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3211) Add AWS Kinesis streaming connector
[ https://issues.apache.org/jira/browse/FLINK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274018#comment-15274018 ] Robert Metzger commented on FLINK-3211: --- I've spend quite some time with the Kinesis connector to get it working with Flink properly, but there is one big issue: The Kinesis producer library (and the client library) both depend on protobuf-java 2.6.1, Flink is using protobuf 2.5 (mainly forced by Akka). I tried the following approaches: - Exclude the protobuf-java dependency from Akka --> Doesn't work because it seems to be hardwired somehow - Upgrade to Akka 2.4.0, which doesn't depend on protobuf anymore --> doesn't work because Akka 2.4 depends on Java 8 (and Scala 2.11) - Shade Kinesis connector's protobuf dependency into the "flink-connector-kinesis" jar --> It works (but we can not do it like this due to legal restrictions (the Amazon Software License restricts the use of the software to AWS services, that's not compatible with the ASF)). I see the following solutions: - We merge the kinesis code into master, as an optional module, which is not build for releases. People have to build it themselves - We host the kinesis connector in a separate project (outside of the ASF) - We ask Amazon to exclude the protobuf dependency. - I try again using Akka without Protobuf. > Add AWS Kinesis streaming connector > --- > > Key: FLINK-3211 > URL: https://issues.apache.org/jira/browse/FLINK-3211 > Project: Flink > Issue Type: New Feature > Components: Streaming Connectors >Affects Versions: 1.1.0 >Reporter: Tzu-Li (Gordon) Tai >Assignee: Tzu-Li (Gordon) Tai > Original Estimate: 336h > Remaining Estimate: 336h > > AWS Kinesis is a widely adopted message queue used by AWS users, much like a > cloud service version of Apache Kafka. Support for AWS Kinesis will be a > great addition to the handful of Flink's streaming connectors to external > systems and a great reach out to the AWS community. > AWS supports two different ways to consume Kinesis data: with the low-level > AWS SDK [1], or with the high-level KCL (Kinesis Client Library) [2]. AWS SDK > can be used to consume Kinesis data, including stream read beginning from a > specific offset (or "record sequence number" in Kinesis terminology). On the > other hand, AWS officially recommends using KCL, which offers a higher-level > of abstraction that also comes with checkpointing and failure recovery by > using a KCL-managed AWS DynamoDB "leash table" as the checkpoint state > storage. > However, KCL is essentially a stream processing library that wraps all the > partition-to-task (or "shard" in Kinesis terminology) determination and > checkpointing to allow the user to focus only on streaming application logic. > This leads to the understanding that we can not use the KCL to implement the > Kinesis streaming connector if we are aiming for a deep integration of Flink > with Kinesis that provides exactly once guarantees (KCL promises only > at-least-once). Therefore, AWS SDK will be the way to go for the > implementation of this feature. > With the ability to read from specific offsets, and also the fact that > Kinesis and Kafka share a lot of similarities, the basic principles of the > implementation of Flink's Kinesis streaming connector will very much resemble > the Kafka connector. We can basically follow the outlines described in > [~StephanEwen]'s description [3] and [~rmetzger]'s Kafka connector > implementation [4]. A few tweaks due to some of Kinesis v.s. Kafka > differences is described as following: > 1. While the Kafka connector can support reading from multiple topics, I > currently don't think this is a good idea for Kinesis streams (a Kinesis > Stream is logically equivalent to a Kafka topic). Kinesis streams can exist > in different AWS regions, and each Kinesis stream under the same AWS user > account may have completely independent access settings with different > authorization keys. Overall, a Kinesis stream feels like a much more > consolidated resource compared to Kafka topics. It would be great to hear > more thoughts on this part. > 2. While Kafka has brokers that can hold multiple partitions, the only > partitioning abstraction for AWS Kinesis is "shards". Therefore, in contrast > to the Kafka connector having per broker connections where the connections > can handle multiple Kafka partitions, the Kinesis connector will only need to > have simple per shard connections. > 3. Kinesis itself does not support committing offsets back to Kinesis. If we > were to implement this feature like the Kafka connector with Kafka / ZK to > sync outside view of progress, we probably could use ZK or DynamoDB like the > way KCL works. More thoughts on this part will be very helpful too. > As for
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273993#comment-15273993 ] ASF GitHub Bot commented on FLINK-3772: --- Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62325628 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexOutDegree + +Annotate vertices of a directed graph with the out-degree count. +{% highlight java %} +DataSet > outDegree = graph + .run(new VertexOutDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexDegreePair + +Annotate vertices of a directed graph with both the out-degree and in-degree count. +{% highlight java %} +DataSet >> pairDegree = graph + .run(new VertexDegreePair() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.VertexDegree + +Annotate vertices of an undirected graph with the degree count. +{% highlight java %} +DataSet > degree = graph + .run(new VertexDegree() +.setIncludeZeroDegreeVertices(true) +.setReduceOnTargetId(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.EdgeSourceDegree + +Annotate edges of an undirected graph with degree of the source ID. --- End diff -- An `Edge` is always directed but a `Graph` may be undirected if it contains a matching reverse of every edge. > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62325628 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexOutDegree + +Annotate vertices of a directed graph with the out-degree count. +{% highlight java %} +DataSet > outDegree = graph + .run(new VertexOutDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexDegreePair + +Annotate vertices of a directed graph with both the out-degree and in-degree count. +{% highlight java %} +DataSet >> pairDegree = graph + .run(new VertexDegreePair() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.VertexDegree + +Annotate vertices of an undirected graph with the degree count. +{% highlight java %} +DataSet > degree = graph + .run(new VertexDegree() +.setIncludeZeroDegreeVertices(true) +.setReduceOnTargetId(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.EdgeSourceDegree + +Annotate edges of an undirected graph with degree of the source ID. --- End diff -- An `Edge` is always directed but a `Graph` may be undirected if it contains a matching reverse of every edge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3754) Add a validation phase before construct RelNode using TableAPI
[ https://issues.apache.org/jira/browse/FLINK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273989#comment-15273989 ] ASF GitHub Bot commented on FLINK-3754: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1958#issuecomment-217430461 I took a brief look at the PR. The overall structure of the validation looks good to me. Will do a more in-depth review in the next days. @twalthr, can you have another look as well? Thanks, Fabian > Add a validation phase before construct RelNode using TableAPI > -- > > Key: FLINK-3754 > URL: https://issues.apache.org/jira/browse/FLINK-3754 > Project: Flink > Issue Type: Improvement > Components: Table API >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Assignee: Yijie Shen > > Unlike sql string's execution, which have a separate validation phase before > RelNode construction, Table API lacks the counterparts and the validation is > scattered in many places. > I suggest to add a single validation phase and detect problems as early as > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3754][Table]Add a validation phase befo...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1958#issuecomment-217430461 I took a brief look at the PR. The overall structure of the validation looks good to me. Will do a more in-depth review in the next days. @twalthr, can you have another look as well? Thanks, Fabian --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273983#comment-15273983 ] ASF GitHub Bot commented on FLINK-3772: --- Github user greghogan commented on the pull request: https://github.com/apache/flink/pull/1901#issuecomment-21742 @vasia I had looked at one of the failed tests and it wasn't due to bad logic but rather to running out of some minimal number of memory buffers. This results, I expect, from updating the `Graph` API's degree functions to translate the output of the degree algorithms. I think it is best to leave the `Graph` API as-is for now. > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user greghogan commented on the pull request: https://github.com/apache/flink/pull/1901#issuecomment-21742 @vasia I had looked at one of the failed tests and it wasn't due to bad logic but rather to running out of some minimal number of memory buffers. This results, I expect, from updating the `Graph` API's degree functions to translate the output of the degree algorithms. I think it is best to leave the `Graph` API as-is for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273945#comment-15273945 ] ASF GitHub Bot commented on FLINK-3650: --- Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62320682 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- Thanks for the tip @tillrohrmann . Let me see how I can adapt the CaseClassTypeInfo for SelectByMax/Min function. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...
Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62320682 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- Thanks for the tip @tillrohrmann . Let me see how I can adapt the CaseClassTypeInfo for SelectByMax/Min function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm
[ https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273941#comment-15273941 ] ASF GitHub Bot commented on FLINK-2044: --- Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1956#issuecomment-217419465 Hi @gallenvara, @greghogan, - If it's possible to return both the hub and the authority value, I'd prefer that. - GSA iterations allow setting the edge direction as Greg suggested. I'm not sure how much of a difference the combiner would make. Also, we've seen that for some graphs scatter-gather performs better. Personally, I would be fine with a first scatter-gather version of the algorithm. We can run some tests to see whether GSA would be faster later. - I agree that edge values should be internally set to `NullValue` if not used. > Implementation of Gelly HITS Algorithm > -- > > Key: FLINK-2044 > URL: https://issues.apache.org/jira/browse/FLINK-2044 > Project: Flink > Issue Type: New Feature > Components: Gelly >Reporter: Ahamd Javid >Assignee: GaoLun >Priority: Minor > > Implementation of Hits Algorithm in Gelly API using Java. the feature branch > can be found here: (https://github.com/JavidMayar/flink/commits/HITS) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1956#issuecomment-217419465 Hi @gallenvara, @greghogan, - If it's possible to return both the hub and the authority value, I'd prefer that. - GSA iterations allow setting the edge direction as Greg suggested. I'm not sure how much of a difference the combiner would make. Also, we've seen that for some graphs scatter-gather performs better. Personally, I would be fine with a first scatter-gather version of the algorithm. We can run some tests to see whether GSA would be faster later. - I agree that edge values should be internally set to `NullValue` if not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2926) Add a Strongly Connected Components Library Method
[ https://issues.apache.org/jira/browse/FLINK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273933#comment-15273933 ] Vasia Kalavri commented on FLINK-2926: -- Hi [~mliesenberg], delta iteration by default finishes when the workset is empty, but I don't see why it couldn't support a custom convergence criterion also. I thought this method was there already. In fact the [iterations guide|https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/iterations.html] states that the delta iteration supports "Custom aggregator convergence" so that's weird. Can you please open an issue for that? Thanks! > Add a Strongly Connected Components Library Method > -- > > Key: FLINK-2926 > URL: https://issues.apache.org/jira/browse/FLINK-2926 > Project: Flink > Issue Type: Task > Components: Gelly >Affects Versions: 0.10.0 >Reporter: Andra Lungu >Assignee: Martin Liesenberg >Priority: Minor > Labels: requires-design-doc > > This algorithm operates in four main steps: > 1). Form the transposed graph (each vertex sends its id to its out neighbors > which form a transposedNeighbors set) > 2). Trimming: every vertex which has only incoming or outgoing edges sets > colorID to its own value and becomes inactive. > 3). Forward traversal: >Start phase: propagate id to out neighbors >Rest phase: update the colorID with the minimum value seen > until convergence > 4). Backward traversal: > Start: if the vertex id is equal to its color id > propagate the value to transposedNeighbors > Rest: each vertex that receives a message equal to its > colorId will propagate its colorId to the transposed graph and becomes > inactive. > More info in section 3.1 of this paper: > http://ilpubs.stanford.edu:8090/1077/3/p535-salihoglu.pdf > or in section 6 of this paper: http://www.vldb.org/pvldb/vol7/p1821-yan.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1901#issuecomment-217403403 @greghogan, after your last commit, there are several gelly test cases failing on travis. Could you please take a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273868#comment-15273868 ] ASF GitHub Bot commented on FLINK-3772: --- Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1901#issuecomment-217403403 @greghogan, after your last commit, there are several gelly test cases failing on travis. Could you please take a look? Thanks! > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273867#comment-15273867 ] Fabian Hueske commented on FLINK-1873: -- [~chobeat], I gave you contributor permissions for JIRA as well. You can now assign issues to yourself. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske updated FLINK-1873: - Assignee: Simone Robutti > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62310212 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- The `CaseClassTypeInfo` is the type information which is created for Scala tuples, if I'm not mistaken. And all Scala tuples are of type `Product`. With that you should be able to adapt the `SelectByMax/MinFunction`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310380 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. --- End diff -- What do you mean by in-degree "count"? What's the result of this operation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62309620 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- I think the purpose of Flink-3650 is adding support for Scala tuples. Thus, I would rather drop support for Java tuples in the Scala API than for Scala tuples. I would assume that you have to implement a Scala specific `SelectByMax/MinFunction` to support Scala tuples. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273854#comment-15273854 ] Simone Robutti commented on FLINK-1873: --- Sure. I will have to study a bit to do it properly but I was already going to do something like that for an algorithm I'm implementing (MinHash). > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: liaoyuxi > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273860#comment-15273860 ] ASF GitHub Bot commented on FLINK-3772: --- Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310380 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. --- End diff -- What do you mean by in-degree "count"? What's the result of this operation? > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273863#comment-15273863 ] ASF GitHub Bot commented on FLINK-3772: --- Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310412 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexOutDegree + +Annotate vertices of a directed graph with the out-degree count. +{% highlight java %} +DataSet > outDegree = graph + .run(new VertexOutDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexDegreePair + +Annotate vertices of a directed graph with both the out-degree and in-degree count. +{% highlight java %} +DataSet >> pairDegree = graph + .run(new VertexDegreePair() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.VertexDegree + +Annotate vertices of an undirected graph with the degree count. +{% highlight java %} +DataSet > degree = graph + .run(new VertexDegree() +.setIncludeZeroDegreeVertices(true) +.setReduceOnTargetId(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.EdgeSourceDegree + +Annotate edges of an undirected graph with degree of the source ID. --- End diff -- When you say "undirected" graph, do you mean the algorithm does not take into account the edge direction or does the input graph have to be undirected? Since gelly graphs are in fact always directed and we simply add opposite-direction edges to represent undirected graphs, we should be clear about this. Also, which is the source vertex is an undirected edge? > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310412 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexOutDegree + +Annotate vertices of a directed graph with the out-degree count. +{% highlight java %} +DataSet > outDegree = graph + .run(new VertexOutDegree() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.directed.VertexDegreePair + +Annotate vertices of a directed graph with both the out-degree and in-degree count. +{% highlight java %} +DataSet >> pairDegree = graph + .run(new VertexDegreePair() +.setIncludeZeroDegreeVertices(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.VertexDegree + +Annotate vertices of an undirected graph with the degree count. +{% highlight java %} +DataSet > degree = graph + .run(new VertexDegree() +.setIncludeZeroDegreeVertices(true) +.setReduceOnTargetId(true)); +{% endhighlight %} + + + + + degree.annotate.undirected.EdgeSourceDegree + +Annotate edges of an undirected graph with degree of the source ID. --- End diff -- When you say "undirected" graph, do you mean the algorithm does not take into account the edge direction or does the input graph have to be undirected? Since gelly graphs are in fact always directed and we simply add opposite-direction edges to represent undirected graphs, we should be clear about this. Also, which is the source vertex is an undirected edge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3772) Graph algorithms for vertex and edge degree
[ https://issues.apache.org/jira/browse/FLINK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273862#comment-15273862 ] ASF GitHub Bot commented on FLINK-3772: --- Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310388 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); --- End diff -- What does the `setIncludeZeroDegreeVertices` do and what other options are available? Could you please document all of them (for the rest of algorithms also)? > Graph algorithms for vertex and edge degree > --- > > Key: FLINK-3772 > URL: https://issues.apache.org/jira/browse/FLINK-3772 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > Many graph algorithms require vertices or edges to be marked with the degree. > This ticket provides algorithms for annotating > * vertex degree for undirected graphs > * vertex out-, in-, and out- and in-degree for directed graphs > * edge source, target, and source and target degree for undirected graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3772] [gelly] Graph algorithms for vert...
Github user vasia commented on a diff in the pull request: https://github.com/apache/flink/pull/1901#discussion_r62310388 --- Diff: docs/apis/batch/libs/gelly.md --- @@ -2067,7 +2067,92 @@ configuration. - TranslateGraphIds + degree.annotate.directed.VertexInDegree + +Annotate vertices of a directed graph with the in-degree count. +{% highlight java %} +DataSet> inDegree = graph + .run(new VertexInDegree() +.setIncludeZeroDegreeVertices(true)); --- End diff -- What does the `setIncludeZeroDegreeVertices` do and what other options are available? Could you please document all of them (for the rest of algorithms also)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273858#comment-15273858 ] ASF GitHub Bot commented on FLINK-3650: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62310212 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- The `CaseClassTypeInfo` is the type information which is created for Scala tuples, if I'm not mistaken. And all Scala tuples are of type `Product`. With that you should be able to adapt the `SelectByMax/MinFunction`. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273852#comment-15273852 ] ASF GitHub Bot commented on FLINK-3650: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62309620 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- I think the purpose of Flink-3650 is adding support for Scala tuples. Thus, I would rather drop support for Java tuples in the Scala API than for Scala tuples. I would assume that you have to implement a Scala specific `SelectByMax/MinFunction` to support Scala tuples. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273826#comment-15273826 ] Fabian Hueske commented on FLINK-1873: -- I agree, looks like the issue is abandoned. [~chobeat], if you want to work on this issue, I can assign it to you. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: liaoyuxi > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1502] [WIP] Basic Metric System
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1947#issuecomment-217389510 Okay. I think we need to support multiple instances of the same job on a TaskManager. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273812#comment-15273812 ] ASF GitHub Bot commented on FLINK-1502: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1947#issuecomment-217389510 Okay. I think we need to support multiple instances of the same job on a TaskManager. > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: pre-apache > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3857][Streaming Connectors]Add reconnec...
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1962#issuecomment-217373539 How did you test the code you've implemented in this pull request? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3857) Add reconnect attempt to Elasticsearch host
[ https://issues.apache.org/jira/browse/FLINK-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273771#comment-15273771 ] ASF GitHub Bot commented on FLINK-3857: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/1962#issuecomment-217373539 How did you test the code you've implemented in this pull request? > Add reconnect attempt to Elasticsearch host > --- > > Key: FLINK-3857 > URL: https://issues.apache.org/jira/browse/FLINK-3857 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors >Affects Versions: 1.1.0, 1.0.2 >Reporter: Fabian Hueske >Assignee: Subhobrata Dey > > Currently, the connection to the Elasticsearch host is opened in > {{ElasticsearchSink.open()}}. In case the connection is lost (maybe due to a > changed DNS entry), the sink fails. > I propose to catch the Exception for lost connections in the {{invoke()}} > method and try to re-open the connection for a configurable number of times > with a certain delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273770#comment-15273770 ] Simone Robutti commented on FLINK-1873: --- This is issue has been dead for one year. What about reassigning it? I think it would help implement many algorithms and right now people needs to implement their distributed operation on matrices everytime. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: liaoyuxi > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1502] [WIP] Basic Metric System
Github user zentol commented on the pull request: https://github.com/apache/flink/pull/1947#issuecomment-217368732 @rmetzger you are correct that you're job failed since the previous one wasn't cleaned up yet. Should you try to run 2 identical jobs in parallel it will fail, since 2 jobs would use the same metrics due to name clashes. Note that in this version this also occurs when 2 operators have the same name. I have some additional functionality coming up that would allow you to circumvent this issue. @ankitcha The problem with multiple reporters is our configuration, it only supports single-line key-value pairs, and you need to know the exact key to access it. In order to configure multiple reporters you would either need a nested structure (which is not supportet), or index the configuration keys (metrics.reporter.1.class) and add a new parameter containing the indices to use (e.g. metrics.reporter: 0, 1), which isn't particularly user-friendly. The metric system itself could deal with multiple reporters with minor modifications; it's all about the configuration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.
[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273740#comment-15273740 ] ASF GitHub Bot commented on FLINK-1502: --- Github user zentol commented on the pull request: https://github.com/apache/flink/pull/1947#issuecomment-217368732 @rmetzger you are correct that you're job failed since the previous one wasn't cleaned up yet. Should you try to run 2 identical jobs in parallel it will fail, since 2 jobs would use the same metrics due to name clashes. Note that in this version this also occurs when 2 operators have the same name. I have some additional functionality coming up that would allow you to circumvent this issue. @ankitcha The problem with multiple reporters is our configuration, it only supports single-line key-value pairs, and you need to know the exact key to access it. In order to configure multiple reporters you would either need a nested structure (which is not supportet), or index the configuration keys (metrics.reporter.1.class) and add a new parameter containing the indices to use (e.g. metrics.reporter: 0, 1), which isn't particularly user-friendly. The metric system itself could deal with multiple reporters with minor modifications; it's all about the configuration. > Expose metrics to graphite, ganglia and JMX. > > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Affects Versions: 0.9 >Reporter: Robert Metzger >Assignee: Chesnay Schepler >Priority: Minor > Fix For: pre-apache > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3650) Add maxBy/minBy to Scala DataSet API
[ https://issues.apache.org/jira/browse/FLINK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273738#comment-15273738 ] ASF GitHub Bot commented on FLINK-3650: --- Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62296025 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- @tillrohrmann Any suggestions here? Should we handle the scala Tuple in another JIRA? Am not an expert in Scala. Just started working with it. So if it can be handled in another JIRA, this PR can be integrated. > Add maxBy/minBy to Scala DataSet API > > > Key: FLINK-3650 > URL: https://issues.apache.org/jira/browse/FLINK-3650 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 1.1.0 >Reporter: Till Rohrmann >Assignee: ramkrishna.s.vasudevan > > The stable Java DataSet API contains the API calls {{maxBy}} and {{minBy}}. > These methods are not supported by the Scala DataSet API. These methods > should be added in order to have a consistent API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: FLINK-3650 Add maxBy/minBy to Scala DataSet AP...
Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/1856#discussion_r62296025 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SelectByMinFunction.java --- @@ -72,8 +73,8 @@ public T reduce(T value1, T value2) throws Exception { for (int position : fields) { // Save position of compared key // Get both values - both implement comparable - Comparable comparable1 = value1.getFieldNotNull(position); - Comparable comparable2 = value2.getFieldNotNull(position); + Comparable comparable1 = ((Tuple)value1).getFieldNotNull(position); --- End diff -- @tillrohrmann Any suggestions here? Should we handle the scala Tuple in another JIRA? Am not an expert in Scala. Just started working with it. So if it can be handled in another JIRA, this PR can be integrated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---