Maintain heavy hitters in Flink application

2017-11-29 Thread m@xi
Hello everyone! I want to implement a streaming algorithm like Misa-Gries or Space Saving in Flink. The goal is to maintain the heavy hitters for my (possibly unbounded) input streams throughout all the time my app runs. More precisely, I want to have a non-stop running task that runs the Space

Weird performance on custom Hashjoin w.r.t. parallelism

2017-11-08 Thread m@xi
Hello everyone! I have implemented a custom parallel hashjoin algorithm (without windows feature) in order to calculate the join of two input streams on a common attribute using the CoFlatMap function and the state. After the join operator (which has parallelism p = #processors) operator I have a

Re: Weird performance on custom Hashjoin w.r.t. parallelism

2017-11-08 Thread m@xi
Hello! I found out that the cause of the problem was the map that I have after the parallel join with parallelism 1. When I changed it to .map(new MyMapMeter).setParallelism(p) then when I increase the number of parallelism p the completion time decreases, which is reasonable. Somehow it was a

Re: Use keyBy to deterministically hash each record to a processor/task/slot

2017-11-02 Thread m@xi
Hello Tony, Thanks a lot for your answer. Now I know exactly what happens with keyBy function, yet still I haven't figured out a proper (non hard coded way) to deterministically send a tuple to each key. If somenone from the Flink team could help it would be great! Max -- Sent from:

Re: Use keyBy to deterministically hash each record to a processor/task/slot

2017-11-02 Thread m@xi
Hello Dongwon, Thanks a lot for your excellent reply! Seems we have the same problem. Still your solution is less hard coded than mine. Thanks a lot! I am also looking forward to see a capability of creating a custom partitioner for keyBy() in Flink. Best, Max -- Sent from:

Re: Maintain heavy hitters in Flink application

2017-12-08 Thread m@xi
Kostas and Fabian, Thanks for the advice. I guess I will find a workaround to do the state redistribution. I also read about side outputs in this thread, which might be also an option that I will consider.

Use keyBy to deterministically hash each record to a processor/task/slot

2017-10-30 Thread m@xi
Hi all, After trying to understand exactly how keyBy works internally, I did not get anything more than "it applies obj.hashcode() % n", where n is the number of tasks/processors. This post for example

Re: Setting the parallelism in a cluster of machines properly

2018-04-27 Thread m@xi
Hi Michael! Seems that you were correct. It is weird that I could not set parallelism = 136. I cannot configure the cluster properly so far. I do everything as it is described here [1]. It seems that the JobManager is not reachable. Best, Max [1] --

Re: Setting the parallelism in a cluster of machines properly

2018-04-28 Thread m@xi
The TaskManager cannot reach the JobManager. I get this error. Any ideas? Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager. Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Storm topology running in flink.

2018-05-11 Thread m@xi
Hello there, I have storm code and I need to run it. If possible I would like to run it with Flink. Is this possible and this feature stable now? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread m@xi
Hello Fabian! Thanks for the answer. No I did not. Is this a requirement? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread m@xi
Hey Fabian! Sorry for being unaware regarding Flink configurations, but for me I have followed every step but still setting a simple cluster of 2 nodes proved to be a pain in the as@@#. So, to which value you think I should set the akka timeout? Also, in my head the process is the following :

Re: Setting the parallelism in a cluster of machines properly

2018-04-26 Thread m@xi
No man. I have 17 TaskManagers and each has a number of 8 slots. Do you think it is better to have 8 TaskManager (1 slot each) ? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Setting the parallelism in a cluster of machines properly

2018-04-26 Thread m@xi
Hello Flinkers, I have deployed Flink in a cluster of 17 nodes, each having 8 CPUs. Thus, in total there are 136 CPUs available. I have set the parameter askmanager.numberOfTaskSlots = 8 in all machines, since they have 8 CPUs. And when I am going to run ./flink run -c classpath jarFile -p 136

Re: Setting the parallelism in a cluster of machines properly

2018-04-29 Thread m@xi
Guys seriously I have done the process as described in the documentation of the standalone cluster 20 times. After I start the cluster with ./start-cluster.sh, I normally see with jps the JobManager process running in the master and the TaskManager processes running in slaves. Although every time

Re: Maintain heavy hitters in Flink application

2018-01-31 Thread m@xi
Hello everyone and Happy New Year! Regarding the Heavy Hitter tracking...I wanna do it in a distributed manner. Thus, 1 -- Round Robin the input stream to a number of parallel map instances (say p = env.parallelism) 2 -- Each one of the p mappers maintains approximately the HH of its

Re: Maintain heavy hitters in Flink application

2018-02-01 Thread m@xi
Anyone, someone, somebody? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

CoProcess() VS union.Process()

2018-02-09 Thread m@xi
Hello Flinkers, I would like to discuss with you about something that bothers me. So, I have two streams that I want to join along with a third stream which I want to consult its data from time to time and triggers decisions. Essentially, this boils down to coProcessing 3 streams together

Re: CoProcess() VS union.Process() & Timers in them

2018-02-13 Thread m@xi
Hello XingCan, Finally, I did it with union. Now inside the processElement() function of my CoProcessFunction I am setting a timer and periodically I want to print out some data through the onTimer() function. Below I attach the image stating the following: "Caused by:

Re: CoProcess() VS union.Process() & Timers in them

2018-02-13 Thread m@xi
OK Great! Thanks a lot for the super ultra fast answer Fabian! One intuitive follow-up question. So, keyed state is the most preferable one, as it is easy for the Flink System to perform the re-distribution in case of change in parallelism, if we have a scale-up or scale-down. Also, it is

Re: CoProcess() VS union.Process() & Timers in them

2018-02-13 Thread m@xi
Thanks a lot Fabian and Xingcan! @ Fabian : Nice answer. It enhanced my intuition on the topic. So, how one may change the parallelism while the Flink job is running, e.g. lower the parallelism during the weekend? Also, it is not clear to me how to use the rescale() operator. If you may provide

Re: How to find correct import packages ?

2018-02-16 Thread m@xi
Hello Esa, First I would recommend to use Intellij as your IDE. There you may enable auto-completion imports that are needed, or you can select manually from all the available options of imports. Furthermore, the Apache Flink Documentation page, [1], is not that thorough in many aspects, but it

RE: How to find correct import packages ?

2018-02-16 Thread m@xi
Well, you must always "google it" before asking in the lists. https://stackoverflow.com/questions/31211842/any-way-or-shortcut-to-auto-import-the-classes-in-intellij-idea-like-in-eclips Also, if you wanna add a dependency in pom.xml (which is essentially the sceleton of your flink program,

Manipulating Processing elements of Network Buffers

2018-02-15 Thread m@xi
Hello Flinker! I know that one should set appropriately the number of Network Buffers (NB) that its Flink deployment will use. Except from that, I am wondering if one might change/manipulate the specific sequence of data records into the NB in order to optimize the performance of its application.

Re: Use keyBy to deterministically hash each record to a processor/task/slot

2018-02-21 Thread m@xi
Hello! I have used up till now your method to generate keys for the .keyBy() function, in order to specifically know at which processor id each tuple will end up in the end (w.r.t the key % #procs operation). Though I had to shift to Java cause the documentation is better. And I implemented your

Re: Share state across operators

2018-02-21 Thread m@xi
Hey Timo! I am using Java for my implementation and I have found this article [1] in stackoverflow for simulating the Either in Java. Now, for my case, I have a coordinator instance (parallelism = 1) that needs to both distribute incoming tuples in a specific way, but also needs to

Re: CoProcess() VS union.Process() & Timers in them

2018-02-20 Thread m@xi
Hey Fabian! Thanks for the comprehensive replies. Now I understand those concepts properly. Regarding .rescale() , it does not receive any arguments. Thus, I assume that the way it does the shuffling from operator A to operator B instances is a black box for the programmer and probably has to do

Re: CoProcess() VS union.Process() & Timers in them

2018-02-20 Thread m@xi
OK man! Thanks a lot. To tell you the truth the documentation did not explain it in a convincing way to consider it an important/potential operator to use in my applications. Thanks for mentioning. Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Manipulating Processing elements of Network Buffers

2018-02-20 Thread m@xi
Hi Till! Thanks a lot for your useful reply. So now I get it. I should not manipulate or disturb the network buffer contents, as this will trigger other problematic behaviours. On the other hand, the price of buffering the data in my operator first and e.g. sorting them first based on some

Re: Window with recent messages

2018-02-24 Thread m@xi
Hi Krzystzof, I want to do something which is very similar if not identical to yours. Apply sliding windows in my input streams by using "the swiss army knife" of Flink. If it is easy and not a problem for you, it would be great if you uploaded here the skeleton code of your solution. Thanks in

Re: Share state across operators

2018-03-12 Thread m@xi
Hey Flinker! Anyone? Anybody? Someone with experience or any idea on the question above? Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Share state across operators

2018-03-13 Thread m@xi
Thank a lot Timo! Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Window with recent messages

2018-04-11 Thread m@xi
Hello there Krzystzof! Thanks a lot for the answer. Sorry for the late reply. I can see the logic behind custom window processing in Flink. Once, an incoming tuple arrives, you add a timer to it, which is going to tick after "RatingExpiration" time units, as shown in your code. This, is made

Simulating Time-based and Count-based Custom Windows with ProcessFunction

2018-04-13 Thread m@xi
Hello Flinkers! Around here and there one may find some post for sliding windows in Flink. I have read that default sliding windows of Flink, the system maintains each window separately in memory, which in my case is prohibitive. Therefore, I want to implement my own sliding windows through

Re: Kafka 0.11

2018-04-22 Thread m@xi
Hi Piotr! In this page of the documentation [1] I can see the different versions of Kafka Connectors, but I am now learning about Kafka so some help would be valuable. 1 -- Are 0.8, 0.9, 0.11 etc different version of the same thing or do they same thing? I mean does 0.11 offers everything the

Re: Kafka 0.11

2018-04-23 Thread m@xi
Hey Michael! Thanks a lot for your answer. 1 -- OK then. Seems that Kafka version 0.11 is the most preferable since it supports exactly-once semantics. 2 -- I have implemented my algorithm in Flink but I would like to implement it on Kafka streams. All of them should run on a Flink cluster

Re: Kafka 0.11

2018-04-23 Thread m@xi
Thanks a lot Ted! I will look into it! If someone else could elaborate on the other bullets it would be great. Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Install Flink on Microsoft Azure HDInsight

2018-04-24 Thread m@xi
Hello everyone! My task is to install Flink on an HDInsight cluster in Azure. More specifically, I have installed a Kafka cluster (with preconfigured Kafka) as I would like to also combine Flink and Kafka. Unfortunately, Azure does not provide preconfigured cluster for Flink. So I have to

Flink + HDInsight Cluster Deployment

2018-04-24 Thread m@xi
Hello everyone! My task is to install Flink on an HDInsight cluster in Azure. More specifically, I have installed a Kafka cluster (with preconfigured Kafka) as I would like to also combine Flink and Kafka. Unfortunately, Azure does not provide preconfigured cluster for Flink. So I have to

Testing DataSet API's join

2019-05-06 Thread m@xi
Hello Flinkers, I am experimenting a bit with DataSet API and I have written a simple program that joins two (key, value) datasets by key. The server I am running my experiments has 12 cores with 4 threads each, thus I have set the number of slots for a TaskManager to 12x4=48 to leverage the full

Re: Fwd: Complex graph-based sessionization (potential use for stateful functions)

2020-04-07 Thread m@xi
Hello Robert Thanks to your reply I discovered the Stateful Functions which I believe is a quite powerful tool. I have some questions: 1) As you said, "the community is currently in the process of releasing the first Apache release of StateFun and it should hopefully be out by the end of this

Re: Complex graph-based sessionization (potential use for stateful functions)

2020-04-26 Thread m@xi
Hi Igal, Thanks a lot for your answer. I believe you are one of the core developers behind the interesting statefun. Your suggestion is really nice and as you say, one way is to tailor the graph processing to the philosophy of SF. Though, if one vertex is a stateful function, then heavy hitter

Re: Fwd: Complex graph-based sessionization (potential use for stateful functions)

2020-04-26 Thread m@xi
Hi Robert, Thanks a lot for your reply. 1) Now statefun packages are in the MVN repository, so probably they needed some time to really be included there after your official release. 2) Alinged to the topic of the thread, I am referring to state of massive graph streams. To enable

Re: Using Stateful Functions within a Flink pipeline

2020-04-26 Thread m@xi
Dear Igal, Can you elaborate more on your proposed solution of splitting the pipeline? If possible, providing some skeleton pseudocode would be awesome! Thanks in advance. Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: ML/DL via Flink

2020-05-02 Thread m@xi
Hello Timo and Bechet, @Timo: Thanks a lot for the message forwarding. @Bechet: Thanks for your answer. I was not aware of *flink-ai-extended* project. Also I was not aware of the fact that ALink is striving to become the new FlinkML. Definitely, I will look into ALink and flink-ai-extended.

Re: ML/DL via Flink

2020-05-05 Thread m@xi
Hello Becket, I just watched your Flink Forward talk. Really interesting! I leave the link here as it is related to the post. AI Flow (FF20) Best, Max -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

ML/DL via Flink

2020-04-27 Thread m@xi
Hello Flinkers, I am building a *streaming* prototype system on top of Flink and I want ideally to enable ML training (if possible DL) in Flink. It would be nice to lay down all the existing libraries that provide primitives that enable the training of ML models. I assume it is more efficient

Re: Window processing in Stateful Functions

2020-05-06 Thread m@xi
Hello Tez, With all the respect, I doubt your answer is related the question. *Just to re-phase a bit*: Assuming we use SF for our application, how can we apply window logic when a function does some processing? *Is there a proper way?* @*Igal*: we would very much appreciate your answer. :)

Re: Window processing in Stateful Functions

2020-05-08 Thread m@xi
Dear Igal,Very insightful answer. Thanks. Igal Shilman wrote > An alternative approach would be to implement a * > thumbling window * > per vertex(a stateful function instance)by sending to itself a delayed > message [2]. When that specific delayedmessage arrives you wouldhave to > purge the

Re: Using Queryable State within 1 job + docs suggestion

2020-05-22 Thread m@xi
Hi Gordon, Yes we are well aware of the inconsistencies that can (and will) emerge while using queryable state like this. However, we will treat them manually for ensuring the correctness of our targeting applications. Therefore, could you help with Annemarie's question or are you aware of