State in external db (dynamodb)

2016-04-05 Thread Shannon Carey
Hi, new Flink user here! I found a discussion on user@flink.apache.org about using DynamoDB as a sink. However, as noted, sinks have an at-least-once guarantee so your operations must idempotent. However, another way to go about this (and correct me if I'm wrong) is to write the state to the

RemoteTransportException when trying to redis in flink code

2016-04-05 Thread Balaji Rajagopalan
I am trying to use AWS EMR yarn cluster where the flink code runs, in one of apply window function, I try to set some values in redis it fails. I have tried to access the same redis with no flink code and get/set works, but from the flink I get into this exception. Any inputs on what might be

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Chiwan Park
Hi Timur, Great! Bootstrap action for Flink is good for AWS users. I think the bootstrap action scripts would be placed in `flink-contrib` directory. If you want, one of people in PMC of Flink will be assign FLINK-1337 to you. Regards, Chiwan Park > On Apr 6, 2016, at 3:36 AM, Timur Fayruzov

Re: Convert Scala DataStream to Java DataStream

2016-04-05 Thread Stefano Baghino
Hi Saiph, all you have to do is to invoke the `javaStream` method on your Scala DataStream. Hope I've been helpful. :) On Tue, Apr 5, 2016 at 7:35 PM, Saiph Kappa wrote: > Hi, > > I'm programming in scala and using some extra libraries made in Java. My > question is:

Accessing RDF triples using Flink

2016-04-05 Thread Ritesh Kumar Singh
Hi, I need some suggestions regarding accessing RDF triples from flink. I'm trying to integrate flink in a pipeline where the input for flink comes from SPARQL query on a Jena model. And after modification of triples using flink, I will be performing SPARQL update using Jena to save my changes.

Back Pressure details

2016-04-05 Thread Zach Cox
Hi - I'm trying to identify bottlenecks in my Flink streaming job, and am curious about the Back Pressure view in the job manager web UI. If there are already docs for Back Pressure please feel free to just point me to those. :) When "Sampling in progress..." is displayed, what exactly is

Kafka state backend?

2016-04-05 Thread Zach Cox
Hi - as clarified in another thread [1] stateful operators store all of their current state in the backend on each checkpoint. Just curious if Kafka topics with log compaction have ever been considered as a possible state backend? Samza [2] uses RocksDB as a local state store, with all writes

Re: Checkpoint state stored in backend, and deleting old checkpoint state

2016-04-05 Thread Konstantin Knauf
Hi Ufuk, I thought so, but I am not sure when and where ;) I will let you know, if I come across it again. Cheers, Konstantin On 05.04.2016 21:10, Ufuk Celebi wrote: > Hey Zach and Konstantin, > > Great questions and answers. We can try to make this more explicit in the > docs. > > On Tue,

Re: Checkpoint state stored in backend, and deleting old checkpoint state

2016-04-05 Thread Ufuk Celebi
Hey Zach and Konstantin, Great questions and answers. We can try to make this more explicit in the docs. On Tue, Apr 5, 2016 at 8:54 PM, Konstantin Knauf wrote: > To my knowledge flink takes care of deleting old checkpoints (I think it > says so in the

RE: Multiple operations on a WindowedStream

2016-04-05 Thread Kanak Biscuitwala
This worked when I ran my test code locally, but I'm seeing nothing reach my sink when I try to run this in YARN (previously, when I just echo'ed all sums to my sink, it would work). Here's what my code looks like:         StreamExecutionEnvironment env =

Re: Checkpoint state stored in backend, and deleting old checkpoint state

2016-04-05 Thread Konstantin Knauf
Hi Zach, some answers/comments inline. Cheers Konstantin On 05.04.2016 20:39, Zach Cox wrote: > Hi - I have some questions regarding Flink's checkpointing, specifically > related to storing state in the backends. > > So let's say an operator in a streaming job is building up some state. >

Checkpoint state stored in backend, and deleting old checkpoint state

2016-04-05 Thread Zach Cox
Hi - I have some questions regarding Flink's checkpointing, specifically related to storing state in the backends. So let's say an operator in a streaming job is building up some state. When it receives barriers from all of its input streams, does it store *all* of its state to the backend? I

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Yes, Hadoop version was the culprit. It turns out that EMRFS requires at least 2.4.0 (judging from the exception in the initial post, I was not able to find the official requirements). Rebuilding Flink with Hadoop 2.7.1 and with Scala 2.11 worked like a charm and I was able to run WordCount using

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Ufuk Celebi
Hey Timur, if you are using EMR with IAM roles, Flink should work out of the box. You don't need to change the Hadoop config and the IAM role takes care of setting up all credentials at runtime. You don't need to hardcode any keys in your application that way and this is the recommended way to go

Convert Scala DataStream to Java DataStream

2016-04-05 Thread Saiph Kappa
Hi, I'm programming in scala and using some extra libraries made in Java. My question is: how can I easily convert "org.apache.flink.streaming.scala.DataStream" to "org.apache.flink.streaming.api.datastream.DataStream"? Thanks.

Re: Powered by Flink

2016-04-05 Thread Slim Baltagi
Hi The following are missing in the ‘Powered by Flink’ list: king.com https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces88 Otto Group http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Hello Ufuk, I'm using EMR 4.4.0. Thanks, Timur On Tue, Apr 5, 2016 at 2:18 AM, Ufuk Celebi wrote: > Hey Timur, > > which EMR version are you using? > > – Ufuk > > On Tue, Apr 5, 2016 at 1:43 AM, Timur Fayruzov > wrote: > > Thanks for the answer,

Re: Powered by Flink

2016-04-05 Thread Robert Metzger
Hi everyone, I would like to bring the "Powered by Flink" wiki page [1] to the attention of Flink user's who recently joined the Flink community. The list tracks which organizations are using Flink. If your company / university / research institute / ... is using Flink but the name is not yet

Re: Flink Job History Dump

2016-04-05 Thread Ufuk Celebi
Hey Robert! This is currently not possible :-(, but this is a feature that is on Flink's road map. A very inconvenient workaround could be to manually query the REST APIs [1] and dump the responses somewhere and query it there. – Ufuk [1]

Flink Job History Dump

2016-04-05 Thread Robert Schmidtke
Hi everyone, I'm using Flink 0.10.2 to run some benchmarks on my cluster and I would like to compare it to Spark 1.6.0. Spark has an eventLog property that I can use to have the history written to HDFS, and then later view it offline on the History Server. Does Flink have a similar Feature,

Re: Handling large state (incremental snapshot?)

2016-04-05 Thread Hironori Ogibayashi
Aljoscha, Thank you for your quick response. Yes, I am using FsStateBackend, so I will try RocksDB backend. Regards, Hironori 2016-04-05 21:23 GMT+09:00 Aljoscha Krettek : > Hi, > I guess you are using the FsStateBackend, is that correct? You could try > using the RocksDB

Handling large state (incremental snapshot?)

2016-04-05 Thread Hironori Ogibayashi
Hello, I am trying to implement windowed distinct count on a stream. In this case, the state have to hold all distinct value in the window, so can be large. In my test, if the state size become about 400MB, checkpointing takes 40sec and spends most of Taskmanager's CPU. Are there any good way to

Re: CEP API: Question on FollowedBy

2016-04-05 Thread Till Rohrmann
Yes exactly. This is a feature which we still have to add. On Tue, Apr 5, 2016 at 1:07 PM, Anwar Rizal wrote: > Thanks Till. > > The only way I can change the behavior would be to post filter the result > then. > > Anwar. > > On Tue, Apr 5, 2016 at 11:41 AM, Till Rohrmann

Re: CEP API: Question on FollowedBy

2016-04-05 Thread Anwar Rizal
Thanks Till. The only way I can change the behavior would be to post filter the result then. Anwar. On Tue, Apr 5, 2016 at 11:41 AM, Till Rohrmann wrote: > Hi Anwar, > > yes, once we have published the introductory blog post about the CEP > library, we will also publish

Re: YARN High Availability

2016-04-05 Thread Konstantin Knauf
Hi Robert, I tried several paths and rmr before. It stopped after 1-2 minutes. There was an exception on the shell. Sorry, should have attached to the last mail. Thanks, Konstnatin On 05.04.2016 11:22, Robert Metzger wrote: > I've tried reproducing the issue on a test cluster, but everything

Re: [DISCUSS] Allowed Lateness in Flink

2016-04-05 Thread Aljoscha Krettek
By the way. The way I see to fixing this is extending WindowAssigner with an "isEventTime()" method and then allow accumulating/lateness in the WindowOperator only if this is true. But it seems a but hacky because it special cases event-time. But then again, maybe we need to special case it ...

[DISCUSS] Allowed Lateness in Flink

2016-04-05 Thread Aljoscha Krettek
Hi Folks, as part of my effort to improve the windowing in Flink [1] I also thought about lateness, accumulating/discarding and window cleanup. I have some ideas on this but I would love to get feedback from the community as I think that these things are important for everyone doing event-time

How to test serializability of a Flink job

2016-04-05 Thread Simone Robutti
Hello, last week I got a problem where my job worked in local mode but could not be serialized on the cluster. I assume that local mode does not really serialize all the operators (the problem was with a custom map function) and I need to enforce this behaviour in local mode or, better, be able

Re: CEP API: Question on FollowedBy

2016-04-05 Thread Till Rohrmann
Hi Anwar, yes, once we have published the introductory blog post about the CEP library, we will also publish a more in-depth description of the approach we have implemented. To spoil it a little bit: We have mainly followed the paper “Efficient Pattern Matching over Event Streams” for the

Re: YARN High Availability

2016-04-05 Thread Robert Metzger
I've tried reproducing the issue on a test cluster, but everything worked fine. Have you tried different values for "recovery.zookeeper.path.root" or only one? Maybe the path you've put contains invalid data? Regarding the client log you've send: Did you manually stop the client or did it stop

Re: FYI: Updated Slides Section

2016-04-05 Thread Theodore Vasiloudis
Hello all, you can find my slides on Large-Scale Machine Learning with FlinkML here (from SICS Data Science day and FOSDEM 2016): http://www.slideshare.net/TheodorosVasiloudis/flinkml-large-scale-machine-learning-with-apache-flink Best, Theodore On Mon, Apr 4, 2016 at 3:19 PM, Rubén Casado

Re: building for Scala 2.11

2016-04-05 Thread Andrew Gaydenko
Balaji, now I see it is my mistake: I wasn't clear enough in my question, sorry. Saying "the project" I mean Flink project itself. The question is already answered. Regards, Andrew Balaji Rajagopalan writes: > In your pom file you can mention against which

Re: building for Scala 2.11

2016-04-05 Thread Andrew Gaydenko
Chiwan, thanks, got it! - and the build finished with success. I still a little confused with the method used: a tool from tools/ changes files being under the Git control. Regards, Andrew Chiwan Park writes: > Hi Andrew, > > The method to build Flink with Scala 2.11

Re: YARN High Availability

2016-04-05 Thread Ufuk Celebi
Hey Konstantin, just looked at the logs and the cluster is started, but the job is indeed never submitted. I've forwarded this to Robert, because he is familiar with the YARN client. I will look into how the client interacts with the ZooKeeper root path. – Ufuk On Tue, Apr 5, 2016 at 9:18 AM,