[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392725#comment-14392725 ] ASF GitHub Bot commented on FLINK-1219: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/189#issuecomment-88910737 Fix for the build failure is being tested: https://travis-ci.org/rmetzger/flink/builds/56885481 Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393060#comment-14393060 ] ASF GitHub Bot commented on FLINK-1219: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/189 Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354597#comment-14354597 ] ASF GitHub Bot commented on FLINK-1219: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/189#discussion_r26108160 --- Diff: docs/flink_on_tez_guide.md --- @@ -0,0 +1,293 @@ +--- +title: Running Flink on YARN leveraging Tez +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + + +a href=#top/a + +## Introduction + +You can run Flink using Tez as an execution environment. Flink on Tez +is currently included in *flink-staging* in alpha. All classes are +localted in the *org.apache.flink.tez* package. + +## Why Flink on Tez + +[Apache Tez](tez.apache.org) is a scalable data processing +platform. Tez provides an API for specifying a directed acyclic +graph (DAG), and functionality for placing the DAG vertices in YARN +containers, as well as data shuffling. In Flink's architecture, +Tez is at about the same level as Flink's network stack. While Flink's +network stack focuses heavily on low latency in order to support +pipelining, data streaming, and iterative algorithms, Tez +focuses on scalability and elastic resource usage. + +Thus, by replacing Flink's network stack with Tez, users can get scalability +and elastic resource usage in shared clusters while retaining Flink's +APIs, optimizer, and runtime algorithms (local sorts, hash tables, etc). + +Flink programs can run almost unmodified using Tez as an execution +environment. Tez supports local execution (e.g., for debugging), and +remote execution on YARN. + + +## Local execution + +The `LocalTezEnvironment` can be used run programs using the local +mode provided by Tez. This is for example WordCount using Tez local mode. +It is identical to a normal Flink WordCount, except that the `LocalTezEnvironment` is used. +To run in local Tez mode, you can simply run a Flink on Tez program +from your IDE (e.g., right click and run). + +{% highlight java %} +public class WordCountExample { +public static void main(String[] args) throws Exception { +final LocalTezEnvironment env = LocalTezEnvironment.create(); + + DataSetString text = env.fromElements( +Who's there?, +I think I hear them. Stand, ho! Who's there?); + +DataSetTuple2String, Integer wordCounts = text +.flatMap(new LineSplitter()) +.groupBy(0) +.sum(1); + +wordCounts.print(); + +env.execute(Word Count Example); +} + +public static class LineSplitter implements FlatMapFunctionString, Tuple2String, Integer { +@Override +public void flatMap(String line, CollectorTuple2String, Integer out) { +for (String word : line.split( )) { +out.collect(new Tuple2String, Integer(word, 1)); +} +} +} +} +{% endhighlight %} + +## YARN execution + +### Setup + +- Install Tez on your Hadoop 2 cluster following the instructions from the + [Apache Tez website](http://tez.apache.org/install.html). If you are able to run + the examples that ship with Tez, then Tez has been successfully installed. + +- Currently, you need to build Flink yourself to obtain Flink on Tez + (the reason is a Hadoop version compatibility: Tez releases artifacts + on Maven central with a Hadoop 2.6.0 dependency). Build Flink + using `mvn -DskipTests clean package -Dhadoop.version=X.X.X -Pinclude-tez`. + Make sure that the Hadoop version matches the version that Tez uses. + Obtain the jar file contained in the Flink distribution under + `flink-staging/flink-tez/target/flink-tez-x.y.z-flink-fat-jar.jar`
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354624#comment-14354624 ] ASF GitHub Bot commented on FLINK-1219: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/189#issuecomment-78025522 The documentation is very well written. I've tried the pull request on a Hortonworks Sandbox with HDP 2.2. They have Tez 0.5.2.2.2.0.0 installed there. The mvn command to build Tez for this environment is: ``` mvn install -DskipTests -Pinclude-tez -Dhadoop.version=2.6.0 -Dtez.version=0.5.2.2.2.0.0-2041 -Pvendor-repos ``` To change the tez version, I made the following changes to the pom: ```diff diff --git a/flink-staging/flink-tez/pom.xml b/flink-staging/flink-tez/pom.xml index e84f731..147dd32 100644 --- a/flink-staging/flink-tez/pom.xml +++ b/flink-staging/flink-tez/pom.xml @@ -34,6 +34,9 @@ under the License. nameflink-tez/name packagingjar/packaging +properties + tez.version0.6.0/tez.version +/properties dependencies dependency @@ -81,25 +84,25 @@ under the License. dependency groupIdorg.apache.tez/groupId artifactIdtez-api/artifactId -version0.6.0/version +version${tez.version}/version /dependency dependency groupIdorg.apache.tez/groupId artifactIdtez-common/artifactId -version0.6.0/version +version${tez.version}/version /dependency dependency groupIdorg.apache.tez/groupId artifactIdtez-dag/artifactId -version0.6.0/version +version${tez.version}/version /dependency dependency groupIdorg.apache.tez/groupId artifactIdtez-runtime-library/artifactId -version0.6.0/version +version${tez.version}/version /dependency dependency ``` Once these issues have been resolved, I'd say the change is +1 to merge. Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340225#comment-14340225 ] ASF GitHub Bot commented on FLINK-1219: --- Github user ktzoumas commented on the pull request: https://github.com/apache/flink/pull/189#issuecomment-76407035 I tested this on a 25-node cluster with WordCount and the modified TPC-H query on generated TPC-H data of scale 1000. I also added docs. Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340266#comment-14340266 ] ASF GitHub Bot commented on FLINK-1219: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/189#discussion_r25512682 --- Diff: flink-staging/flink-tez/src/main/java/org/apache/flink/tez/client/RemoteTezEnvironment.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.tez.client; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.flink.api.common.JobExecutionResult; +import org.apache.flink.api.common.Plan; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.compiler.DataStatistics; +import org.apache.flink.compiler.PactCompiler; +import org.apache.flink.compiler.costs.DefaultCostEstimator; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.util.ClassUtil; +import org.apache.hadoop.util.ToolRunner; + + +public class RemoteTezEnvironment extends ExecutionEnvironment { + + private static final Log LOG = LogFactory.getLog(TezExecutor.class); + + private PactCompiler compiler; + private TezExecutor executor; + private Path jarPath = null; + + + public void registerMainClass (Class mainClass) { + jarPath = new Path(ClassUtil.findContainingJar(mainClass)); + //this.jarURL = mainClass.getProtectionDomain().getCodeSource().getLocation(); --- End diff -- this looks like debugging code Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340267#comment-14340267 ] ASF GitHub Bot commented on FLINK-1219: --- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/189#discussion_r25512719 --- Diff: flink-staging/flink-tez/src/main/java/org/apache/flink/tez/client/RemoteTezEnvironment.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.tez.client; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.flink.api.common.JobExecutionResult; +import org.apache.flink.api.common.Plan; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.compiler.DataStatistics; +import org.apache.flink.compiler.PactCompiler; +import org.apache.flink.compiler.costs.DefaultCostEstimator; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.util.ClassUtil; +import org.apache.hadoop.util.ToolRunner; + + +public class RemoteTezEnvironment extends ExecutionEnvironment { + + private static final Log LOG = LogFactory.getLog(TezExecutor.class); --- End diff -- The log messages will appear from the wrong class. Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1219) Add support for Apache Tez as execution engine
[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311019#comment-14311019 ] Fabian Hueske commented on FLINK-1219: -- Is this a duplicate of FLINK-972? Add support for Apache Tez as execution engine -- Key: FLINK-1219 URL: https://issues.apache.org/jira/browse/FLINK-1219 Project: Flink Issue Type: New Feature Reporter: Kostas Tzoumas Assignee: Kostas Tzoumas This is an umbrella issue to track Apache Tez support. The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)