Re: Integrate Flink with S3 on EMR cluster

2017-03-10 Thread Robert Metzger
asiest way I found is to run " hadoop classpath " command, and paste > its value for export HADOOP_CLASSPATH variable. > > This way we don't have to copy any S3 specific jars to Flink lib folder. > > > > -- > View this message in context: http://apache-flink-use

Re: Integrate Flink with S3 on EMR cluster

2017-03-08 Thread vinay patil
e.com/Integrate-Flink-with-S3-on-EMR-cluster-tp5894p12101.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Integrate Flink with S3 on EMR cluster

2017-03-07 Thread Shannon Carey
Generally, using S3 filesystem in EMR with Flink has worked pretty well for me in Flink < 1.2 (unless you run out of connections in your HTTP pool). When you say, "using Hadoop File System class", what do you mean? In my experience, it's sufficient to just use the "s3://" filesystem protocol

Re: Integrate Flink with S3 on EMR cluster

2017-03-07 Thread vinay patil
low: > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/Integrate-Flink-with-S3-on-EMR-cluster-tp5894p12053.html > To start a new topic under Apache Flink User Mailing List archive., email > ml-node+s2336050n1...@n4.nabble.com > To unsubscribe from Apache Flink User

Re: Integrate Flink with S3 on EMR cluster

2017-03-06 Thread vinay patil
these libs are already included in the Hadoop classpath. Is there any other way I can make this work ? -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Integrate-Flink-with-S3-on-EMR-cluster-tp5894p12053.html Sent from the Apache Flink User Mailing

Re: Integrate Flink with S3 on EMR cluster

2016-04-10 Thread Stephan Ewen
You can always explicitly request a broadcast join, via "joinWithTiny", "joinWithHuge", or by supplying a JoinHint. Greetings, Stephan On Sat, Apr 9, 2016 at 1:56 AM, Timur Fayruzov wrote: > Thank you Robert. One of my test cases is broadcast join, so I need to >

Re: Integrate Flink with S3 on EMR cluster

2016-04-08 Thread Robert Metzger
Hi Timur, the Flink optimizer runs on the client, so the exception is thrown from the JVM running the ./bin/flink client. Since the statistics sampling is an optional step, its surrounded by a try / catch block that just logs the error message. More answers inline below On Thu, Apr 7, 2016 at

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Timur Fayruzov
The exception does not show up in the console when I run the job, it only shows in the logs. I thought it means that it happens either on AM or TM (I assume what I see in stdout is client log). Is my thinking right? On Thu, Apr 7, 2016 at 12:29 PM, Ufuk Celebi wrote: > Hey

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Ufuk Celebi
Hey Timur, Just had a chat with Robert about this. I agree that the error message is confusing, but it is fine it this case. The file system classes are not on the class path of the client process, which is submitting the job. It fails to sample the input file sizes, but this is just an

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Timur Fayruzov
There's one more filesystem integration failure that I have found. My job on a toy dataset succeeds, but Flink log contains the following message: 2016-04-07 18:10:01,339 ERROR org.apache.flink.api.common.io.DelimitedInputFormat - Unexpected problen while getting the file statistics for

Re: Integrate Flink with S3 on EMR cluster

2016-04-06 Thread Ufuk Celebi
Yes, for sure. I added some documentation for AWS here: https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html Would be nice to update that page with your pull request. :-) – Ufuk On Wed, Apr 6, 2016 at 4:58 AM, Chiwan Park wrote: > Hi Timur, > >

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Chiwan Park
Hi Timur, Great! Bootstrap action for Flink is good for AWS users. I think the bootstrap action scripts would be placed in `flink-contrib` directory. If you want, one of people in PMC of Flink will be assign FLINK-1337 to you. Regards, Chiwan Park > On Apr 6, 2016, at 3:36 AM, Timur Fayruzov

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Yes, Hadoop version was the culprit. It turns out that EMRFS requires at least 2.4.0 (judging from the exception in the initial post, I was not able to find the official requirements). Rebuilding Flink with Hadoop 2.7.1 and with Scala 2.11 worked like a charm and I was able to run WordCount using

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Ufuk Celebi
Hey Timur, if you are using EMR with IAM roles, Flink should work out of the box. You don't need to change the Hadoop config and the IAM role takes care of setting up all credentials at runtime. You don't need to hardcode any keys in your application that way and this is the recommended way to go

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Hello Ufuk, I'm using EMR 4.4.0. Thanks, Timur On Tue, Apr 5, 2016 at 2:18 AM, Ufuk Celebi wrote: > Hey Timur, > > which EMR version are you using? > > – Ufuk > > On Tue, Apr 5, 2016 at 1:43 AM, Timur Fayruzov > wrote: > > Thanks for the answer,

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Timur Fayruzov
Thanks for the answer, Ken. My understanding is that file system selection is driven by the following sections in core-site.xml: fs.s3.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem fs.s3n.impl com.amazon.ws.emr.hadoop.fs.EmrFileSystem If I run the program using

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Ken Krugler
Normally in Hadoop jobs you’d want to use s3n:// as the protocol, not s3. Though EMR has some support for magically treating the s3 protocol as s3n (or maybe s3a now, with Hadoop 2.6 or later) What happens if you use s3n:/// for the --input parameter? — Ken > On Apr 4, 2016, at 2:51pm, Timur

Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Timur Fayruzov
Hello, I'm trying to run a Flink WordCount job on an AWS EMR cluster. I succeeded with a three-step procedure: load data from S3 to cluster's HDFS, run Flink Job, unload outputs from HDFS to S3. However, ideally I'd like to read/write data directly from/to S3. I followed the instructions here: