[jira] [Created] (FLINK-1632) Use DataSet's count() and collect() to simplify Gelly methods
Vasia Kalavri created FLINK-1632: Summary: Use DataSet's count() and collect() to simplify Gelly methods Key: FLINK-1632 URL: https://issues.apache.org/jira/browse/FLINK-1632 Project: Flink Issue Type: Improvement Components: Gelly Affects Versions: 0.9 Reporter: Vasia Kalavri The recently introduced count() and collect() methods of DataSet can be used to simplify several Gelly methods. We can get rid of GraphUtils and also simplify methods which return DataSets with single values, such as numberOfVertices() and isWeaklyConnected(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1633) Add getTriplets() Gelly method
Vasia Kalavri created FLINK-1633: Summary: Add getTriplets() Gelly method Key: FLINK-1633 URL: https://issues.apache.org/jira/browse/FLINK-1633 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.9 Reporter: Vasia Kalavri Priority: Minor In some graph algorithms, it is required to access the graph edges together with the vertex values of the source and target vertices. For example, several graph weighting schemes compute some kind of similarity weights for edges, based on the attributes of the source and target vertices. This issue proposes adding a convenience Gelly method that generates a DataSet of srcVertex, Edge, TrgVertex triplets from the input graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Access flink-conf.yaml data
Hi, Can someone help me on how to access the flink-conf.yaml configuration values inside the flink sources? Are these readily available as a map somewhere? Thanks.
[jira] [Created] (FLINK-1634) Fix Could not build up connection to JobManager issue on some systems
Dulaj Viduranga created FLINK-1634: -- Summary: Fix Could not build up connection to JobManager issue on some systems Key: FLINK-1634 URL: https://issues.apache.org/jira/browse/FLINK-1634 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Dulaj Viduranga Priority: Critical Fix For: 0.9 In some systems, flink 0.9-SNAPSHOT gives an error (org.apache.flink.client.program.ProgramInvocationException: Could not build up connection to JobManager.) when the taskmanager tries to connect to the jobmanager. This is because the taskmanager cannot resolve the IP where the jobmanager runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Access flink-conf.yaml data
I think that you can use `org.apache.flink.configuration.GlobalConfiguration` to obtain configuration object. Regards. Chiwan Park (Sent with iPhone) On Mar 3, 2015, at 12:17 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, Can someone help me on how to access the flink-conf.yaml configuration values inside the flink sources? Are these readily available as a map somewhere? Thanks.
Re: Could not build up connection to JobManager
Hi, I found the fix for this issue and I'll create a pull request in the following day.
Re: Problem mvn install
Hi Matthias, I just checked and could not reproduce the error. The files that Maven RAT complained about do not exist in Flink's master branch. I don't think they are put there as part of the build process. Best, Fabian 2015-03-02 15:09 GMT+01:00 Matthias J. Sax mj...@informatik.hu-berlin.de: Hi, if I start mvn-Dmaven.test.skip=true clean install, the goal fails and I get the following error: Unapproved licenses: flink-clients/bin/src/main/resources/web-docs/js/dagre-d3.js flink-clients/bin/src/main/resources/web-docs/js/d3.js flink-staging/flink-avro/bin/src/test/resources/avro/user.avsc It seems, that an APL2 compatible license is expected. Can anyone comment it this issue. - I am on branch flink/master -Matthias
[jira] [Created] (FLINK-1624) Build of old sources fails due to git-commit-id plugin
Max Michels created FLINK-1624: -- Summary: Build of old sources fails due to git-commit-id plugin Key: FLINK-1624 URL: https://issues.apache.org/jira/browse/FLINK-1624 Project: Flink Issue Type: Bug Reporter: Max Michels Assignee: Max Michels Priority: Minor Fix For: 0.6-incubating Builds for Flink (Stratosphere) versions 0.6.0 fail because of a bug in the maven git-commit-id plugin. https://github.com/ktoso/maven-git-commit-id-plugin/issues/61 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1623) Rename Expression API and Operation Representation
Aljoscha Krettek created FLINK-1623: --- Summary: Rename Expression API and Operation Representation Key: FLINK-1623 URL: https://issues.apache.org/jira/browse/FLINK-1623 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Aljoscha Krettek Right now the package is called flunk-expressions and we refer to the API as the Expression API. The equivalent to DataSet and DataStream is the ExpressionOperation. I'm not very happy with these names, so we should find something that is more marketable before making any big announcements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1618) Add parallel time discretisation for time-window transformations
Gyula Fora created FLINK-1618: - Summary: Add parallel time discretisation for time-window transformations Key: FLINK-1618 URL: https://issues.apache.org/jira/browse/FLINK-1618 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Gyula Fora Currently discretizers for all windowing policies including time are executed with parallelism 1 when they define global windows. (for instance: sum of the last 10 minutes) While this is necessary for arbitrary policies like delta based or user-defined policies. Some discretizers such as Time can be implemented in a distributed fashion. Distributed time discretisers (and other types) can be implemented in the following way: -The discretisers should create StreamWindow s with incrementally increasing ID-s starting from the same value so that it is possible to merge them after the transformation - The partitioner for each discretizer should send the number of partitions created to the merger (the merger should be aware of the number of partitioners present to wait for all the information) - Based on all the partitioning info the merger can merge the windows properly afterwards -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1620) Add pre-aggregator for count windows
Gyula Fora created FLINK-1620: - Summary: Add pre-aggregator for count windows Key: FLINK-1620 URL: https://issues.apache.org/jira/browse/FLINK-1620 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Gyula Fora Currently there is only support for pre-aggregators for tumbling policies. A pre-aggregator should be added for count policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Could not build up connection to JobManager
Calling: java -cp ../examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08 Will not connect to Flink. Its just running a standalone KMeans data generator, not KMeans. I would suspect that the KMeans example is not running as well. You can run the KMeans example like this: bin/flink run ./examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar. On Sat, Feb 28, 2015 at 5:47 AM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, I’m thinking I’m doing something wrong. After setting jobManager address to 127.0.0.1, I can run kmeans example (java -cp ../examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08) But I can’t run word count example (bin/flink run ./examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt' file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count.txt’) I’m not sure whether I’m running it wrong On Feb 26, 2015, at 9:03 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, It’s great to help out. :) Setting 127.0.0.1 instead of “localhost” in jobmanager.rpc.address, helped to build the connection to the jobmanager. Apparently localhost resolving is different in webclient and the jobmanager. I think it’s good to set jobmanager.rpc.address: 127.0.0.1 in future builds. But then I get this error when I tried to run examples. I don’t know if I should move this issue to another thread. If so please tell me. bin/flink run /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt $FLINK_DIRECTORY/count 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 02/26/2015 20:46:23 Job execution switched to status RUNNING. 02/26/2015 20:46:23 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) - FlatMap (FlatMap at main(WordCount.java:69)) - Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED 02/26/2015 20:46:23 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) - FlatMap (FlatMap at main(WordCount.java:69)) - Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING 02/26/2015 20:48:03 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) - FlatMap (FlatMap at main(WordCount.java:69)) - Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to FAILED akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/taskmanager#-1628133761]] after [10 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) at java.lang.Thread.run(Thread.java:745) 02/26/2015 20:48:03 Job execution switched to status FAILING. 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to CANCELED 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, delimiter: ))(1/1) switched to CANCELED 02/26/2015 20:48:03 Job execution switched to status FAILED. org.apache.flink.client.program.ProgramInvocationException: The program execution failed. at org.apache.flink.client.program.Client.run(Client.java:344) at org.apache.flink.client.program.Client.run(Client.java:306) at org.apache.flink.client.program.Client.run(Client.java:300) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) at org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[DISCUSS] Make a release to be announced at ApacheCon
Hi all! ApacheCon is coming up and it is the 15th anniversary of the Apache Software Foundation. In the course of the conference, Apache would like to make a series of announcements. If we manage to make a release during (or shortly before) ApacheCon, they will announce it through their channels. I am very much in favor of doing this, under the strong condition that we are very confident that the master has grown to be stable enough (there are major changes in the distributed runtime since version 0.8 that we are still stabilizing). No use in a widely announced build that does not have the quality. Flink has now many new features that warrant a release soon (once we fixed the last quirks in the new distributed runtime). Notable new features are: - Gelly - Streaming windows - Flink on Tez - Expression API - Distributed Runtime on Akka - Batch mode - Maybe even a first ML library version - Some streaming fault tolerance Robert proposed to have a feature freeze mid Match for that. His cornerpoints were: Feature freeze (forking off release-0.9): March 17 RC1 vote: March 24 The RC1 vote is 20 days before the ApacheCon (13. April). For the last three releases, the average voting time was 20 days: R 0.8.0 -- 14 days R 0.7.0 -- 22 days R 0.6 -- 26 days Please share your opinion on this! Greetings, Stephan
Re: [DISCUSS] Make a release to be announced at ApacheCon
Hey, We have a nice list of new features - it definitely makes sense to have that as a release. On my side I really want to have a first limited version of streaming fault tolerance in it. +1 for Robert's proposal for the deadlines. I'm also volunteering for release manager. Best, Marton On Mon, Mar 2, 2015 at 2:03 PM, Stephan Ewen se...@apache.org wrote: Hi all! ApacheCon is coming up and it is the 15th anniversary of the Apache Software Foundation. In the course of the conference, Apache would like to make a series of announcements. If we manage to make a release during (or shortly before) ApacheCon, they will announce it through their channels. I am very much in favor of doing this, under the strong condition that we are very confident that the master has grown to be stable enough (there are major changes in the distributed runtime since version 0.8 that we are still stabilizing). No use in a widely announced build that does not have the quality. Flink has now many new features that warrant a release soon (once we fixed the last quirks in the new distributed runtime). Notable new features are: - Gelly - Streaming windows - Flink on Tez - Expression API - Distributed Runtime on Akka - Batch mode - Maybe even a first ML library version - Some streaming fault tolerance Robert proposed to have a feature freeze mid Match for that. His cornerpoints were: Feature freeze (forking off release-0.9): March 17 RC1 vote: March 24 The RC1 vote is 20 days before the ApacheCon (13. April). For the last three releases, the average voting time was 20 days: R 0.8.0 -- 14 days R 0.7.0 -- 22 days R 0.6 -- 26 days Please share your opinion on this! Greetings, Stephan
Re: Could not build up connection to JobManager
In some places of the code, localhost is hard coded. When it is resolved by the DNS, it is posible to be directed to a different IP other than 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to 127.0.0.1 and it works like a charm. But hard coding 127.0.0.1 is not a good option because when the jobmanager ip is changed, this becomes an issue again. I'm thinking of setting jobmanager ip from the config.yaml to these places. If you have a better idea on doing this with your experience, please let me know. Best.
[jira] [Created] (FLINK-1622) Add ReducePartial and GroupReducePartial Operators
Aljoscha Krettek created FLINK-1622: --- Summary: Add ReducePartial and GroupReducePartial Operators Key: FLINK-1622 URL: https://issues.apache.org/jira/browse/FLINK-1622 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Aljoscha Krettek This does what a Reduce or GroupReduce Operator does, except it is only performed on a local partition. This is also similar to an explicit combine that can output a type that is different from the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Could not build up connection to JobManager
Wow, great. Can you tell us what the issue was? Am 02.03.2015 09:31 schrieb Dulaj Viduranga vidura...@icloud.com: Hi, I found the fix for this issue and I'll create a pull request in the following day.
[jira] [Created] (FLINK-1621) Create a generalized combine function
Max Michels created FLINK-1621: -- Summary: Create a generalized combine function Key: FLINK-1621 URL: https://issues.apache.org/jira/browse/FLINK-1621 Project: Flink Issue Type: Improvement Components: Distributed Runtime Affects Versions: 0.9 Reporter: Max Michels Fix For: 0.9 Flink allows combiners which accept a type {{I}} and combine the values of this type into type {{O}}. In Google Dataflow, combiners are more generalized. They accept an Input {{I}}, produce an intermediate combine value of {{T}}, and finally an output {{O}}. Flink's combiners are like the {{SimpleCombineFn}} in Google Dataflow. Right now, we translate the {{KeyedCombineFn}} into a {{SortPartition}} followed by a {{MapPartition}} to emulate the Combiner's behavior. Rudimentary performance tests showed that this behavior causes a significant increase in run time compared to the proper Combine implementation. Let's implement a more generalized Combiner to create a better mapping from Google Dataflow to Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1619) Add pre-aggregator for time windows
Gyula Fora created FLINK-1619: - Summary: Add pre-aggregator for time windows Key: FLINK-1619 URL: https://issues.apache.org/jira/browse/FLINK-1619 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Gyula Fora Currently there is only support for pre-aggregators for tumbling policies. A pre-aggregator should be added for time policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Tweets Custom Input Format
Great. Thank you. I gave some feedback in the pull request and asked some questions there. On Fri, Feb 27, 2015 at 5:43 PM, Mustafa Elbehery elbeherymust...@gmail.com wrote: @robert, I have created the PR https://github.com/apache/flink/pull/442, On Fri, Feb 27, 2015 at 11:58 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: @Robert, Thanks I was asking about the procedure. I have opened a Jira ticket for Flink-Contrib and I will create a PR with the naming convention on Wiki, https://issues.apache.org/jira/browse/FLINK-1615, On Fri, Feb 27, 2015 at 11:55 AM, Robert Metzger rmetz...@apache.org wrote: I'm glad you've found the how to contribute guide. I can not describe the process to open a pull request better than already written in the guide. Maybe this link is also helpful for you: https://help.github.com/articles/creating-a-pull-request/ Are you facing a particular error message? Maybe that helps me to help you better. On Fri, Feb 27, 2015 at 10:46 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Actually I am reading How to contribute now to push the code. Its working and tested locally and on the cluster, and i have used it for an ETL. The structure as follow :- Java Pojos for the tweet object, and the nested objects. Parser class using event-driven approach, and the SimpleTweetInputFormat itself. Would you guide me how to push the code, just to save sometime :) On Fri, Feb 27, 2015 at 10:42 AM, Robert Metzger rmetz...@apache.org wrote: Hi, cool! Can you generalize the input format to read JSON into an arbitrary POJO? It would be great if you could contribute the InputFormat into the flink-contrib module. I've seen many users reading JSON data with Flink, so its good to have a standard solution for that. If you want you can add the Tweet into POJO as an example into flink-contrib. On Fri, Feb 27, 2015 at 10:37 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I am really sorry for being so late, it was a whole month of projects and examination, I was really busy. @Robert, it is IF for reading tweet into Pojo. I use an event-driven parser, I retrieve most of the tweet into Java Pojos, it was tested on 1TB dataset, for a Flink ETL job, and the performance was pretty good. On Sun, Jan 25, 2015 at 7:38 PM, Robert Metzger rmetz...@apache.org wrote: Hey, is it a input format for reading JSON data or an IF for reading tweets in some format into a pojo? I think a JSON Input Format would be something very useful for our users. Maybe you can add that and use the Tweet IF as a concrete example for that? Do you have a preview of the code somewhere? Best, Robert On Sat, Jan 24, 2015 at 11:06 AM, Fabian Hueske fhue...@gmail.com wrote: Hi Mustafa, that would be a nice contribution! We are currently discussing how to add non-core API features into Flink [1]. I will move this discussion onto the mailing list to decide where to add cool add-ons like yours. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-1398 2015-01-23 20:42 GMT+01:00 Henry Saputra henry.sapu...@gmail.com : Contributions are welcomed! Here is the link on how to contribute to Apache Flink: http://flink.apache.org/how-to-contribute.html You can start by creating JIRA ticket [1] to help describe what you wanted to do and to get feedback from community. - Henry [1] https://issues.apache.org/jira/secure/Dashboard.jspa On Fri, Jan 23, 2015 at 10:54 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I have created a custom InputFormat for tweets on Flink, based on JSON-Simple event driven parser. I would like to contribute my work into Flink, Regards. -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/
Re: [DISCUSS] Offer Flink with Scala 2.11
+1 I also like it. We just have to figure out how we can publish two sets of release artifacts. On Mon, Mar 2, 2015 at 4:48 PM, Stephan Ewen se...@apache.org wrote: Big +1 from my side! Does it have to be a Maven profile, or does a maven property work? (Profile may be needed for quasiquotes dependency?) On Mon, Mar 2, 2015 at 4:36 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Hi there, since I'm relying on Scala 2.11.4 on a project I've been working on, I created a branch which updates the Scala version used by Flink from 2.10.4 to 2.11.4: https://github.com/stratosphere/flink/commits/scala_2.11 Everything seems to work fine and the PR contains minor changes compared to Spark: https://issues.apache.org/jira/browse/SPARK-4466 If you're interested, I can rewrite this as a Maven Profile and open a PR so people can build Flink with 2.11 support. I suggest to do this sooner rather than later in order to * the number of code changes enforced by migration small and tractable; * discourage the use of deprecated or 2.11-incompatible source code in future commits; Regards, A.
Re: Problem mvn install
Matthias! The files should not exist. Has some IDE setup copied the files into the bin directory (as part of compiling it without maven) ? It looks like you are building it not through maven really... BTW: Does it make a difference whether you use mvn -Dmaven.test.skip=true clean install or mvn -DskipTests clean install Stephan On Mon, Mar 2, 2015 at 3:38 PM, Fabian Hueske fhue...@gmail.com wrote: Hi Matthias, I just checked and could not reproduce the error. The files that Maven RAT complained about do not exist in Flink's master branch. I don't think they are put there as part of the build process. Best, Fabian 2015-03-02 15:09 GMT+01:00 Matthias J. Sax mj...@informatik.hu-berlin.de : Hi, if I start mvn-Dmaven.test.skip=true clean install, the goal fails and I get the following error: Unapproved licenses: flink-clients/bin/src/main/resources/web-docs/js/dagre-d3.js flink-clients/bin/src/main/resources/web-docs/js/d3.js flink-staging/flink-avro/bin/src/test/resources/avro/user.avsc It seems, that an APL2 compatible license is expected. Can anyone comment it this issue. - I am on branch flink/master -Matthias
Re: [DISCUSS] Offer Flink with Scala 2.11
Spark currently only provides pre-builds for 2.10 and requires custom build for 2.11. Not sure whether this is the best idea, but I can see the benefits from a project management point of view... Would you prefer to have a {scala_version} × {hadoop_version} integrated on the website? 2015-03-02 16:57 GMT+01:00 Aljoscha Krettek aljos...@apache.org: +1 I also like it. We just have to figure out how we can publish two sets of release artifacts. On Mon, Mar 2, 2015 at 4:48 PM, Stephan Ewen se...@apache.org wrote: Big +1 from my side! Does it have to be a Maven profile, or does a maven property work? (Profile may be needed for quasiquotes dependency?) On Mon, Mar 2, 2015 at 4:36 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Hi there, since I'm relying on Scala 2.11.4 on a project I've been working on, I created a branch which updates the Scala version used by Flink from 2.10.4 to 2.11.4: https://github.com/stratosphere/flink/commits/scala_2.11 Everything seems to work fine and the PR contains minor changes compared to Spark: https://issues.apache.org/jira/browse/SPARK-4466 If you're interested, I can rewrite this as a Maven Profile and open a PR so people can build Flink with 2.11 support. I suggest to do this sooner rather than later in order to * the number of code changes enforced by migration small and tractable; * discourage the use of deprecated or 2.11-incompatible source code in future commits; Regards, A.
Re: Thoughts About Object Reuse and Collection Execution
On Mon, Mar 2, 2015 at 5:17 PM, Stephan Ewen se...@apache.org wrote: There are two execution modes for the runtime: reuse and non-reuse. That makes a fair bit of sense.
Re: Problem mvn install
I guess, Eclipse created those files. I delete them manually, what resolved the problem. I added bin to my local .gitignore and thus git status did not list the files and I was not aware the they are not part of the repository. As far as I know, -Dmaven.test.skip=true is equal to -DskipTests. -Matthias On 03/02/2015 04:54 PM, Stephan Ewen wrote: Matthias! The files should not exist. Has some IDE setup copied the files into the bin directory (as part of compiling it without maven) ? It looks like you are building it not through maven really... BTW: Does it make a difference whether you use mvn -Dmaven.test.skip=true clean install or mvn -DskipTests clean install Stephan On Mon, Mar 2, 2015 at 3:38 PM, Fabian Hueske fhue...@gmail.com wrote: Hi Matthias, I just checked and could not reproduce the error. The files that Maven RAT complained about do not exist in Flink's master branch. I don't think they are put there as part of the build process. Best, Fabian 2015-03-02 15:09 GMT+01:00 Matthias J. Sax mj...@informatik.hu-berlin.de : Hi, if I start mvn-Dmaven.test.skip=true clean install, the goal fails and I get the following error: Unapproved licenses: flink-clients/bin/src/main/resources/web-docs/js/dagre-d3.js flink-clients/bin/src/main/resources/web-docs/js/d3.js flink-staging/flink-avro/bin/src/test/resources/avro/user.avsc It seems, that an APL2 compatible license is expected. Can anyone comment it this issue. - I am on branch flink/master -Matthias signature.asc Description: OpenPGP digital signature
Re: Queries regarding RDFs with Flink
Hey Santosh! RDF processing often involves either joins, or graph-query like operations (transitive). Flink is fairly good at both types of operations. I would look into the graph examples and the graph API for a start: - Graph examples: https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/graph - Graph API: https://github.com/apache/flink/tree/master/flink-staging/flink-gelly/src/main/java/org/apache/flink/graph If you have a more specific question, I can give you better pointers ;-) Stephan On Fri, Feb 27, 2015 at 4:48 PM, santosh_rajaguru sani...@gmail.com wrote: Hello, how can flink be useful for processing the data to RDFs and build the ontology? Regards, Santosh -- View this message in context: http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Queries-regarding-RDFs-with-Flink-tp4130.html Sent from the Apache Flink (Incubator) Mailing List archive. mailing list archive at Nabble.com.
[jira] [Created] (FLINK-1625) Add cancel method to user defined sources and sinks and call them on task cancellation
Gyula Fora created FLINK-1625: - Summary: Add cancel method to user defined sources and sinks and call them on task cancellation Key: FLINK-1625 URL: https://issues.apache.org/jira/browse/FLINK-1625 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Gyula Fora Currently on task cancellation the user defined functions get interrupted without notice. This can cause serious problems for functions that have established connection with the outside world, for instance message queue connectors, file sources etc. An explicit cancel() method should be added to the SourceFunction and SinkFunction interfaces so that the user would be forced to implement the cancel functionality which is necessary for the specific udf. The cancel() method in the StreamVertex should also be implemented in a way that it calls the cancel methods of the Sink and Source functions on cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Thoughts About Object Reuse and Collection Execution
@Ted Here is a bit of background about how things are currently done in the Flink runtime: There are two execution modes for the runtime: reuse and non-reuse. - The non-reuse mode will create new objects for every record received from the network, or taken out of a sort-buffer or hash-table. It has the advantage that is some user code group operation (groupReduce) materializes the group, it works simply (dedicated objects for each element) - The reuse mode will have one or two objects that will be reused each time a record is received from the network, or taken out of a sort-buffer or hash-table. It behaves similar as the object reuse in Hadoop and saves in garbage collection, but requires the user code to sometimes be aware of object reuse Flink has multiple runtime backends that can execute a program. Some do not support both modes: - Flink's own runtime. Works in both reuse and non-reuse mode, depending on what users select. - Java Collections (non-parallel, for testing and lightweight embedding). Tolerates objects reuse in user code but does not attempt to reuse by itself. - Tez (coming up) will in the long run support also both modes (reuse and non-reuse Where the internal algorithms (sorting/ hashing) do not affect any user code behavior, they always work in reusing mode. Greetings, Stephan On Sat, Feb 28, 2015 at 10:33 AM, Aljoscha Krettek aljos...@apache.org wrote: @Stephan, they are not copied when object reuse is enabled. This might be a problem, though, so maybe we should just change it in that place. On Sat, Feb 28, 2015 at 7:57 AM, Ted Dunning ted.dunn...@gmail.com wrote: This is going to have profound performance implications if this is the only path for iteration. On Fri, Feb 27, 2015 at 10:58 PM, Stephan Ewen se...@apache.org wrote: I vote to have the key extractor return a new value each time. That means that objects are not reused everywhere where it is possible, but still in most places, which still helps. What still puzzles me: I thought that the collection execution stores copies of the returned records by default (reuse safe mode). Am 27.02.2015 15:36 schrieb Aljoscha Krettek aljos...@apache.org: Hello Nation of Flink, while figuring out this bug: https://issues.apache.org/jira/browse/FLINK-1569 I came upon some difficulties. The problem is that the KeyExtractorMappers always return the same tuple. This is problematic, since Collection Execution does simply store the returned values in a list. These elements are not copied before they are stored when object reuse is enabled. Therefore, the whole list will contain only that one reused element. I see two options for solving this: 1. Change KeyExtractorMappers to always return a new tuple, thereby making object-reuse mode in cluster execution useless for key extractors. 2. Change collection execution mapper to always make copies of the returned elements. This would make object-reuse in collection execution pretty much obsolete, IMHO. How should we proceed with this? Cheers, Aljoscha
Re: [DISCUSS] Offer Flink with Scala 2.11
+1 for Scala 2.11 On Mon, Mar 2, 2015 at 5:02 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Spark currently only provides pre-builds for 2.10 and requires custom build for 2.11. Not sure whether this is the best idea, but I can see the benefits from a project management point of view... Would you prefer to have a {scala_version} × {hadoop_version} integrated on the website? 2015-03-02 16:57 GMT+01:00 Aljoscha Krettek aljos...@apache.org: +1 I also like it. We just have to figure out how we can publish two sets of release artifacts. On Mon, Mar 2, 2015 at 4:48 PM, Stephan Ewen se...@apache.org wrote: Big +1 from my side! Does it have to be a Maven profile, or does a maven property work? (Profile may be needed for quasiquotes dependency?) On Mon, Mar 2, 2015 at 4:36 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Hi there, since I'm relying on Scala 2.11.4 on a project I've been working on, I created a branch which updates the Scala version used by Flink from 2.10.4 to 2.11.4: https://github.com/stratosphere/flink/commits/scala_2.11 Everything seems to work fine and the PR contains minor changes compared to Spark: https://issues.apache.org/jira/browse/SPARK-4466 If you're interested, I can rewrite this as a Maven Profile and open a PR so people can build Flink with 2.11 support. I suggest to do this sooner rather than later in order to * the number of code changes enforced by migration small and tractable; * discourage the use of deprecated or 2.11-incompatible source code in future commits; Regards, A.
[jira] [Created] (FLINK-1626) Spurious failure in MatchTask cancelling test
Stephan Ewen created FLINK-1626: --- Summary: Spurious failure in MatchTask cancelling test Key: FLINK-1626 URL: https://issues.apache.org/jira/browse/FLINK-1626 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 This is a problem in the test. Cancelling a task may actually throw an exception (especially interrupted exceptions). The test can be modified to either only call cancel and not call interrupt, or to tolerate exceptions that are followup exceptions of interrupted exceptions. I would prepare a patch for the first approach. The stack trace of the symptom is blow: {code} java.lang.RuntimeException: Hashtable closing was interrupted at org.apache.flink.runtime.operators.hash.MutableHashTable.close(MutableHashTable.java:652) at org.apache.flink.runtime.operators.hash.ReusingBuildFirstHashMatchIterator.close(ReusingBuildFirstHashMatchIterator.java:100) at org.apache.flink.runtime.operators.MatchDriver.cleanup(MatchDriver.java:179) at org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriverInternal(DriverTestBase.java:245) at org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriver(DriverTestBase.java:175) at org.apache.flink.runtime.operators.MatchTaskTest$4.run(MatchTaskTest.java:783) Tests run: 47, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 15.589 sec FAILURE! - in org.apache.flink.runtime.operators.MatchTaskTest testCancelHashMatchTaskWhileBuildFirst[1](org.apache.flink.runtime.operators.MatchTaskTest) Time elapsed: 1.029 sec FAILURE! java.lang.AssertionError: Test threw an exception even though it was properly canceled. at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.runtime.operators.MatchTaskTest.testCancelHashMatchTaskWhileBuildFirst(MatchTaskTest.java:802) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1627) Netty channel connect deadlock
Ufuk Celebi created FLINK-1627: -- Summary: Netty channel connect deadlock Key: FLINK-1627 URL: https://issues.apache.org/jira/browse/FLINK-1627 Project: Flink Issue Type: Bug Reporter: Ufuk Celebi [~StephanEwen] reports the following deadlock (https://travis-ci.org/StephanEwen/incubator-flink/jobs/52755844, logs: https://s3.amazonaws.com/flink.a.o.uce.east/travis-artifacts/StephanEwen/incubator-flink/477/477.2.tar.gz). {code} CHAIN Partition - Map (Map at testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (2/4) daemon prio=10 tid=0x7f5fdc008800 nid=0xe230 in Object.wait() [0x7f5fca8f2000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xf2a13530 (a java.lang.Object) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179) - locked 0xf2a13530 (a java.lang.Object) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:64) at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287) - locked 0xf29dbcd8 (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306) at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75) at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34) at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59) at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205) at java.lang.Thread.run(Thread.java:745) {code} {code} CHAIN Partition - Map (Map at testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (3/4) daemon prio=10 tid=0x7f5fdc005000 nid=0xe22f in Object.wait() [0x7f5fca9f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xf2a13530 (a java.lang.Object) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179) - locked 0xf2a13530 (a java.lang.Object) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:79) at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287) - locked 0xf2896f88 (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306) at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75) at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34) at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59) at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205) at
[jira] [Created] (FLINK-1628) Strange behavior of where function during a join
Daniel Bali created FLINK-1628: -- Summary: Strange behavior of where function during a join Key: FLINK-1628 URL: https://issues.apache.org/jira/browse/FLINK-1628 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Daniel Bali Hello! If I use the `where` function with a field list during a join, it exhibits strange behavior. Here is the sample code that triggers the error: https://gist.github.com/balidani/d9789b713e559d867d5c This example joins a DataSet with itself, then counts the number of rows. If I use `.where(0, 1)` the result is (22), which is not correct. If I use `EdgeKeySelector`, I get the correct result (101). When I pass a field list to the `equalTo` function (but not `where`), everything works again. If I don't include the `groupBy` and `reduceGroup` parts, everything works. Also, when working with large DataSets, passing a field list to `where` makes it incredibly slow, even though I don't see any exceptions in the log (in DEBUG mode). Does anybody know what might cause this problem? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1629) Add option to start Flink on YARN in a detached mode
Robert Metzger created FLINK-1629: - Summary: Add option to start Flink on YARN in a detached mode Key: FLINK-1629 URL: https://issues.apache.org/jira/browse/FLINK-1629 Project: Flink Issue Type: Improvement Components: YARN Client Reporter: Robert Metzger Assignee: Robert Metzger Right now, we expect the YARN command line interface to be connected with the Application Master all the time to control the yarn session or the job. For very long running sessions or jobs users want to just fire and forget a job/session to YARN. Stopping the session will still be possible using YARN's tools. Also, prior to detaching itself, the CLI frontend could print the required command to kill the session as a convenience. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1630) Add option to YARN client to re-allocate failed containers
Robert Metzger created FLINK-1630: - Summary: Add option to YARN client to re-allocate failed containers Key: FLINK-1630 URL: https://issues.apache.org/jira/browse/FLINK-1630 Project: Flink Issue Type: Improvement Components: YARN Client Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Robert Metzger The current Flink YARN client tries to allocate only the initial number of containers. If a containers fail (in particular for long-running sessions) there is no way of re-allocating them. We should add a option to the ApplicationMaster to re-allocate missing/failed containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Offer Flink with Scala 2.11
I'm +1 if this doesn't affect existing Scala 2.10 users. I would also suggest to add a scala 2.11 build to travis as well to ensure everything is working with the different Hadoop/JVM versions. It shouldn't be a big deal to offer scala_version x hadoop_version builds for newer releases. You only need to add more builds here: https://github.com/apache/flink/blob/master/tools/create_release_files.sh#L131 On Mon, Mar 2, 2015 at 6:17 PM, Till Rohrmann trohrm...@apache.org wrote: +1 for Scala 2.11 On Mon, Mar 2, 2015 at 5:02 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Spark currently only provides pre-builds for 2.10 and requires custom build for 2.11. Not sure whether this is the best idea, but I can see the benefits from a project management point of view... Would you prefer to have a {scala_version} × {hadoop_version} integrated on the website? 2015-03-02 16:57 GMT+01:00 Aljoscha Krettek aljos...@apache.org: +1 I also like it. We just have to figure out how we can publish two sets of release artifacts. On Mon, Mar 2, 2015 at 4:48 PM, Stephan Ewen se...@apache.org wrote: Big +1 from my side! Does it have to be a Maven profile, or does a maven property work? (Profile may be needed for quasiquotes dependency?) On Mon, Mar 2, 2015 at 4:36 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: Hi there, since I'm relying on Scala 2.11.4 on a project I've been working on, I created a branch which updates the Scala version used by Flink from 2.10.4 to 2.11.4: https://github.com/stratosphere/flink/commits/scala_2.11 Everything seems to work fine and the PR contains minor changes compared to Spark: https://issues.apache.org/jira/browse/SPARK-4466 If you're interested, I can rewrite this as a Maven Profile and open a PR so people can build Flink with 2.11 support. I suggest to do this sooner rather than later in order to * the number of code changes enforced by migration small and tractable; * discourage the use of deprecated or 2.11-incompatible source code in future commits; Regards, A.
February 2015 in the Flink community
Hi everyone February might be the shortest month of the year, but the community has been pretty busy: - Flink 0.8.1, a bugfix release has been made available - The project added a new committer - Flink contributors developed a Flink adapter for Apache SAMOA - Flink committers contributed to Google's bdutil. Starting from release 1.2, users of bdutil can deploy Flink clusters on Google Cloud Platform - Flink was mentioned on several articles online, and many large features have been merged to the master repository. You can read the full blog post here: http://flink.apache.org/news/2015/03/02/february-2015-in-flink.html
[jira] [Created] (FLINK-1631) Port collisions in ProcessReaping tests
Stephan Ewen created FLINK-1631: --- Summary: Port collisions in ProcessReaping tests Key: FLINK-1631 URL: https://issues.apache.org/jira/browse/FLINK-1631 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The process reaping tests for the JobManager spawn a process that starts a webserver on the default port. It may happen that this port is not available, due to another concurrently running task. I suggest to add an option to not start the webserver to prevent this, by setting the webserver port to {{-1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)