Re: whats is the purpose or impact of -yst( --yarnstreaming ) argument

2016-05-25 Thread Aljoscha Krettek
Hi Prateek, this is a deprecated setting that affects how memory is allocated in Flink Worker nodes. Since at least 1.0.0 the default behavior is the behavior that would previously be requested by the --yst flag. In short, you don't need the flag when running streaming programs. (Except Robert

Re: Incremental updates

2016-05-25 Thread Aljoscha Krettek
Hi, first question: are you manually keying by "userId % numberOfPartitions"? Flink internally does roughly "key.hash() % numPartitions" so it is enough to specify the userId as your key. Now, for you questions: 1. What Flink guarantees is that the state for a key k is always available when an

How to perform multiple stream join functionality

2016-05-25 Thread prateekarora
Hi I am trying to port my spark application in flink. In spark i have used below command to join multiple stream : val stream=stream1.join(stream2).join(stream3).join(stream4) As per my understanding flink required window operation because flink don't works on RDD like spark. so i

whats is the purpose or impact of -yst( --yarnstreaming ) argument

2016-05-25 Thread prateekarora
Hi i am running flink kafka stream application . and have not seen any impact of -yst( --yarnstreaming ) argument in my application . i thought this argument is introduces in 1.0.2 . can any one explain what is the purpose of this argument . Regards Prateek -- View this message in

Re: Incremental updates

2016-05-25 Thread Malgorzata Kudelska
Hi, I have the following situation. - a keyed stream with a key defined as: userId % numberOfPartitions - a custom flatMap transformation where I use a StateValue variable to keep the state of some calculations for each userId - my questions are: 1. Does flink guarantee that the users with a given

Re: stream keyBy without repartition

2016-05-25 Thread Bart Wyatt
Aljoscha, Thanks for the pointers. I was able to get a pretty simple utility class up and running that gives me basic keyed fold/reduce/windowedFold/windowedReduce operations that don't change the partitioning. ​ This will be invaluable until an official feature is supported Cheers,

Re: enableObjectReuse and multiple downstream operators

2016-05-25 Thread Aljoscha Krettek
Hi Bart, yup, this is a bug. AFAIK it is now known, would you like to open the Jira issue for it? If not, I can also open one. The problem is in the interaction of how chaining works in the streaming API with object reuse. As you said, with how it is implemented it serially calls the two map

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Thank you, i will study that. it is a bit more raw i would say. The thing is my source is Kafka. I will have to see how to combine all of that altogether in the most elegant way possible. Will get back to you on this, after i scratch my head enough. Best, Daniel On Wed, May 25, 2016 at 11:02

enableObjectReuse and multiple downstream operators

2016-05-25 Thread Bart Wyatt
(For reference, I'm in 1.0.3) I have a job that looks like this: DataStream input = ... input .map(MapFunction...) .addSink(...); input .map(MapFunction...) ?.addSink(...); If I do not call enableObjectReuse() it works, if I do call enableObjectReuse() it

Re: Non blocking operation in Apache flink

2016-05-25 Thread Aljoscha Krettek
I see what you mean now. The Akka Streams API is very interesting, in how they allow async calls. For Flink, I think you could implement it as a custom source that listens for the change stream, starts futures to get data from the database and emits elements when the future completes. I quickly

Re: Logging Exceptions

2016-05-25 Thread David Kim
Awesome, thank you! David On Wed, May 25, 2016 at 4:54 AM Aljoscha Krettek wrote: > Hi David, > you are right, for some exceptions Flink only forwards to the > web-dashboard/application client but does not print to the log file. I > opened a Jira issue to track this:

Re: Logging with slf4j

2016-05-25 Thread Stefano Baghino
It looks like the logs should show up. Are you using log4j or logback? Did you include the binding to slf4j for the library you're using? Did any warning on slf4j appear on stdout when the job is run? If you set up all correctly, can you share your logback/log4j config file with us so that we can

Logging with slf4j

2016-05-25 Thread simon peyer
Hi guys Trying to log stuff, I used the print/println function which works quite well. But now I would like to use the slf4j logger. For each class/object in scala I intiliaized a logger like this: var log: Logger = LoggerFactory.getLogger(getClass) then I did some logging

Re: stream keyBy without repartition

2016-05-25 Thread Bart Wyatt
​I will give this a shot this morning. Considering this and the other email "Does Kafka connector leverage Kafka message keys?" which also ends up talking about hacking around KeyedStream's use of a HashPartitioner<>(...) is it worth looking in to providing a KeyedStream constructor that uses

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread omaralvarez
Thanks to everybody, all my doubts are solved. I gotta give it to you guys, the answers were really fast! Cheers, Omar. -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-ingestion-using-a-Flink-TCP-Server-tp7134p7169.html Sent from the

Re: How to perform this join operation?

2016-05-25 Thread Stephan Ewen
Hi Elias! I think you brought up a couple of good issues. Let me try and summarize what we have so far: 1) Joining in a more flexible fashion => The problem you are solving with the trailing / sliding window combination: Is the right way to phrase the join problem "join records where key is

Re: Merging sets with common elements

2016-05-25 Thread Simone Robutti
@Till: A more meaningful example would be the following: from {{1{1,2}},{2,{2,3}},{3,{4,5},{4{1,27 the result should be {1,2,3,27},{4,5} because the set #1,#2 and #4 have at least one element in common. If you see this as a graph where the elements of the sets are nodes and a set express a

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Aljoscha Krettek
Hi, yes, if you have 4 independent non-parallel sources they will be executed independently in different threads. Cheers, Aljoscha On Wed, 25 May 2016 at 13:40 omaralvarez wrote: > Thanks for your answers, this makes the implementation way easier, I don't > have to

Re: Incremental updates

2016-05-25 Thread Malgorzata Kudelska
Hi, Thanks for your reply. Is Flink able to detect that an additional server joined and rebalance the processing? How is it done if I have a keyed stream and some custom ValueState variables? Cheers, Gosia 2016-05-25 11:32 GMT+02:00 Aljoscha Krettek : > Hi Gosia, > right

Re: Combining streams with static data and using REST API as a sink

2016-05-25 Thread Josh
Hi Aljoscha, That sounds exactly like the kind of feature I was looking for, since my use-case fits the "Join stream with slowly evolving data" example. For now, I will do an implementation similar to Max's suggestion. Of course it's not as nice as the proposed feature, as there will be a delay

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Stephan Ewen
Hi! A typical example of a parallel source is the Kafka Source. Actually, other threads than the main run() thread can call ctx.collect(), provided they use the checkpoint lock properly. The Kafka source does that. Stephan On Wed, May 25, 2016 at 11:50 AM, omaralvarez

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Maybe the following can illustrate better what i mean http://doc.akka.io/docs/akka/2.4.6/scala/stream/stream-integrations.html#Integrating_with_External_Services On Wed, May 25, 2016 at 5:16 AM, Aljoscha Krettek wrote: > Hi, > there is no functionality to have asynchronous

Re: Logging Exceptions

2016-05-25 Thread Aljoscha Krettek
Hi David, you are right, for some exceptions Flink only forwards to the web-dashboard/application client but does not print to the log file. I opened a Jira issue to track this: FLINK-3969 . Thanks for reporting! Aljoscha On Mon, 23 May 2016 at

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Thank you for your answer. Maybe I should have mentioned that I am at the beginning with both framework, somewhat making a choice by evaluating their capability. I know Akka stream better. So my question would be simple. Let say that 1-/ have a stream of event that are simply information about

Re: Merging sets with common elements

2016-05-25 Thread Aljoscha Krettek
Hi, if I understood it correctly the "key" in that case would be a fuzzy/probabilistic key. I'm not sure this can be computed using either the sort-based or hash-based joinging/grouping strategies of Flink. Maybe we can find something if you elaborate. Cheers, Aljoscha On Wed, 25 May 2016 at

Re: Merging sets with common elements

2016-05-25 Thread Till Rohrmann
Hi Simone, could you elaborate a little bit on the actual operation you want to perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is contained in both sets? Cheers, Till On Wed, May 25, 2016 at 10:22 AM,

Re: Dynamic partitioning for stream output

2016-05-25 Thread Kostas Kloudas
Hi Juho, To be more aligned with the semantics in Flink, I would suggest a solution with a single modified RollingSink that caches multiple buckets (from the Bucketer) and flushes (some of) them to disk whenever certain time or space criteria are met. I would say that it is worth modifying

Re: subtasks and objects

2016-05-25 Thread Aljoscha Krettek
Hi, I think this is correct, yes. It is probably not a good idea to use static code in Flink jobs. Cheers, Aljoscha On Wed, 25 May 2016 at 00:27 Stavros Kontopoulos wrote: > Hey, > > Is it true that since taskmanager (a jvm) may have multiple subtasks > implementing

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Aljoscha Krettek
Hi, regarding your first question. I think it is in general not safe to call ctx.collect() from Threads other than the Thread that is invoking the run() method of your SourceFunction. What I would suggest is to have a queue that your reader threads put data into and then read from that queue in

Re: stream keyBy without repartition

2016-05-25 Thread Aljoscha Krettek
Hi, what Kostas said is correct. You can however, hack it. You would have to manually instantiate a WindowOperator and apply it on the non-keyed DataStream while still providing a key-selector (and serializer) for state. This might sound complicated but I'll try and walk you through the steps.

Re: Flink config as an argument (akka.client.timeout)

2016-05-25 Thread Ufuk Celebi
On Wed, May 25, 2016 at 8:49 AM, Juho Autio wrote: > Is there any way to set akka.client.timeout (or other flink config) when > calling bin/flink run instead of editing flink-conf.yaml? I tried to add it > as a -yD flag but couldn't get it working. I think this currently

Re: writeAsCSV with partitionBy

2016-05-25 Thread Aljoscha Krettek
Hi, the RollingSink can only be used with streaming. Adding support for dynamic paths based on element contents is certainly interesting. I imagine it can be tricky, though, to figure out when to close/flush the buckets. Cheers, Aljoscha On Wed, 25 May 2016 at 08:36 KirstiLaurila

Merging sets with common elements

2016-05-25 Thread Simone Robutti
Hello, I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data). The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple

Flink config as an argument (akka.client.timeout)

2016-05-25 Thread Juho Autio
Is there any way to set akka.client.timeout (or other flink config) when calling bin/flink run instead of editing flink-conf.yaml? I tried to add it as a -yD flag but couldn't get it working. Related: https://issues.apache.org/jira/browse/FLINK-3964

Re: Dynamic partitioning for stream output

2016-05-25 Thread Juho Autio
Related issue: https://issues.apache.org/jira/browse/FLINK-2672 On Wed, May 25, 2016 at 9:21 AM, Juho Autio wrote: > Thanks, indeed the desired behavior is to flush if bucket size exceeds a > limit but also if the bucket has been open long enough. Contrary to the > current

Re: writeAsCSV with partitionBy

2016-05-25 Thread KirstiLaurila
Maybe, I don't know, but with streaming. How about batch? Srikanth wrote > Isn't this related to -- https://issues.apache.org/jira/browse/FLINK-2672 > ?? > > This can be achieved with a RollingSink[1] & custom Bucketer probably. > > [1] >

Re: writeAsCSV with partitionBy

2016-05-25 Thread Juho Autio
RollingSink is part of Flink Streaming API. Can it be used in Flink Batch jobs, too? As implied in FLINK-2672, RollingSink doesn't support dynamic bucket paths based on the tuple fields. The path must be given when creating the RollingSink instance, ie. before deploying the job. Yes, a custom

Re: Dynamic partitioning for stream output

2016-05-25 Thread Juho Autio
Thanks, indeed the desired behavior is to flush if bucket size exceeds a limit but also if the bucket has been open long enough. Contrary to the current RollingSink we don't want to flush all the time if the bucket changes but have multiple buckets "open" as needed. In our case the date to use