[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-96828941 This PR was split into PR #632 and PR #633 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/571 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r29129684 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- I do not have a HBase setup here. Could you try to exclude all dependencies of hbase-server and add them until it works? I hope the TableInputFormat and TableOutputFormat have not too many external dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r29130488 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Ok, I hope to be able to do it before this evening! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28966088 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Could you do that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28961417 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Fair enough. Then the dependency should not be in test scope, but in the default scope, so users get this dependency into their fat jar as well when using the HBase output format. May be worth to define a few exclusions, though, to not get the complete tail of transitive HBase dependencies (I think that even includes JRuby and so on) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28869610 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Unfortunately the TableInputFormat and TableOutputFormat are in the server jar. For the read we've reimplemented it to make it more robust so we don't need that jar, but for the output it is indeed required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28805740 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Is the HBase server dependency really required for any client that wants to write into HBase? This seems like a pretty bad design on the HBase side. Can you tell us what fails when you omit this dependency? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28765996 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- It is needed if you want to use the HBase TableOutputFormat --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28765691 --- Diff: flink-staging/flink-hbase/src/test/resources/hbase-site.xml --- @@ -22,14 +22,13 @@ -- configuration +!-- --- End diff -- Are these mandatory parameters to use HBase? Otherwise, we should remove them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28765745 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- Why did you add this dependency? There are no additional tests that would require it, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28766220 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- But why is it in test scope then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-94735401 Looks good, except two minor things. Once these are resolved I would merge it and also backport it to the 0.8 branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28766303 --- Diff: flink-staging/flink-hbase/src/test/resources/hbase-site.xml --- @@ -22,14 +22,13 @@ -- configuration +!-- --- End diff -- OK, let's remove them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28766073 --- Diff: flink-staging/flink-hbase/src/test/resources/hbase-site.xml --- @@ -22,14 +22,13 @@ -- configuration +!-- --- End diff -- I think you can remove the hbase-site.xml file. It is required only if you have hbase settings different from the default ones. Also the log4j.properties could be removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r28768490 --- Diff: flink-staging/flink-hbase/pom.xml --- @@ -112,6 +112,12 @@ under the License. /exclusion /exclusions /dependency + dependency --- End diff -- But putting it into test scope is not a proper solution to solve the issue. IMO, it should be either put in the regular scope such that it can be used at runtime or we put it into a comment and add a line explaining why we did that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-90481118 Ok, I created this issue (https://issues.apache.org/jira/browse/FLINK-1834) about the mapred.output.dir --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-90475934 Is the Hadoop configuration specified in the flink-conf.yaml loaded? If we set `mapred.output.dir` then we should check for an existing config entry beforehand. Otherwise, we overwrite Hadoop configuration values. Like @fhueske suggested, please open a JIRA for investigation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
GitHub user fpompermaier opened a pull request: https://github.com/apache/flink/pull/571 Fixed Configurable HadoopOutputFormat (FLINK-1828) See https://issues.apache.org/jira/browse/FLINK-1828 You can merge this pull request into a Git repository by running: $ git pull https://github.com/fpompermaier/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/571.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #571 commit 83655bf2773a871c0fb88481be51c6d61ee98881 Author: fpompermaier f.pomperma...@gmail.com Date: 2015-04-04T10:57:36Z Fixed Configurable Hadoop output format initialization, added a simple HBase sink test and upgraded HBase dependencies (from 0.98.6 to 0.98.11) commit 85dbacf46c6f97f6033a4247cdd60ded87b93641 Author: fpompermaier f.pomperma...@gmail.com Date: 2015-04-04T10:57:36Z Fixed Configurable Hadoop output format initialization, added a simple HBase sink test and upgraded HBase dependencies (from 0.98.6 to 0.98.11) commit da39bd2da2ab6ae03ff90b4434e167b8278d2df2 Author: fpompermaier f.pomperma...@gmail.com Date: 2015-04-04T11:11:55Z Merge branch 'master' of https://github.com/fpompermaier/flink.git Conflicts: flink-java/src/main/java/org/apache/flink/api/java/hadoop/mapreduce/HadoopOutputFormatBase.java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r27768581 --- Diff: flink-staging/flink-hbase/src/test/java/org/apache/flink/addons/hbase/example/HBaseWriteExample.java --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.addons.hbase.example; + +import org.apache.flink.api.common.functions.FlatMapFunction; +import org.apache.flink.api.common.functions.RichMapFunction; +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.util.Collector; +import org.apache.hadoop.hbase.client.Mutation; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.mapreduce.TableOutputFormat; +import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; + +@SuppressWarnings(serial) +public class HBaseWriteExample { + + // * + // PROGRAM + // * + + public static void main(String[] args) throws Exception { + + if(!parseParameters(args)) { + return; + } + + // set up the execution environment + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + + // get input data + DataSetString text = getTextDataSet(env); + + DataSetTuple2String, Integer counts = + // split up the lines in pairs (2-tuples) containing: (word,1) + text.flatMap(new Tokenizer()) + // group by the tuple field 0 and sum up tuple field 1 + .groupBy(0) + .sum(1); + + // emit result +// if(fileOutput) { --- End diff -- the `if` statement should be completely removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/571#discussion_r27768585 --- Diff: flink-staging/flink-hbase/src/test/java/org/apache/flink/addons/hbase/example/HBaseWriteExample.java --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.addons.hbase.example; + +import org.apache.flink.api.common.functions.FlatMapFunction; +import org.apache.flink.api.common.functions.RichMapFunction; +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.util.Collector; +import org.apache.hadoop.hbase.client.Mutation; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.mapreduce.TableOutputFormat; +import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; + +@SuppressWarnings(serial) +public class HBaseWriteExample { + + // * + // PROGRAM + // * + + public static void main(String[] args) throws Exception { + + if(!parseParameters(args)) { + return; + } + + // set up the execution environment + final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); + + // get input data + DataSetString text = getTextDataSet(env); + + DataSetTuple2String, Integer counts = + // split up the lines in pairs (2-tuples) containing: (word,1) + text.flatMap(new Tokenizer()) + // group by the tuple field 0 and sum up tuple field 1 + .groupBy(0) + .sum(1); + + // emit result +// if(fileOutput) { + Job job = Job.getInstance(); + job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, outputTableName); + // TODO is mapred.output.dir really useful? + job.getConfiguration().set(mapred.output.dir,/tmp/test); + counts.map(new RichMapFunction Tuple2String,Integer, Tuple2Text,Mutation() { + private final byte[] CF_SOME = Bytes.toBytes(test-column); + private final byte[] Q_SOME = Bytes.toBytes(value); + private transient Tuple2Text, Mutation reuse; + + @Override + public void open(Configuration parameters) throws Exception { + super.open(parameters); + reuse = new Tuple2Text, Mutation(); + } + + @Override + public Tuple2Text, Mutation map(Tuple2String, Integer t) throws Exception { + reuse.f0 = new Text(t.f0); + Put put = new Put(t.f0.getBytes()); + put.add(CF_SOME, Q_SOME, Bytes.toBytes(t.f1)); + reuse.f1 = put; + return reuse; + } + }).output(new HadoopOutputFormatText, Mutation(new TableOutputFormatText(), job)); +// } else { --- End diff -- `else` branch not necessary --- If your project is set up for it, you can reply to this email and have your reply appear on
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-89559472 Looks good to me, except for some comments that should be removed. Regarding the `mapred.output.dir` parameter I am not sure whether this is generally expected for all Hadoop OutputFormats or only required for file-based OutputFormats. I would keep it for now and open a JIRA to investigate the issue and fix it if necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Fixed Configurable HadoopOutputFormat (FLINK-1...
Github user fpompermaier commented on the pull request: https://github.com/apache/flink/pull/571#issuecomment-89570344 Removed comments and commented code as suggested by Fabian. Do I have also to create a JIRA ticket about mapred.output.dir parameter? I think that it can be defaulted to the Flink temp directory or flinkTempDir/hadoop/job-id --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---