Re: How to perform efficient DataSet reuse between iterations

2017-12-07 Thread Fabian Hueske
Hi Miguel, I just had another idea that you could try. The problem seems to be that the plan space that the optimizer enumerates exponentially grows as you add more iterations. This happens when there are multiple valid execution strategies for a given operator (mostly applied to Joins). You coul

Re: How to perform efficient DataSet reuse between iterations

2017-12-07 Thread Miguel Coimbra
Hello Fabian, It really looks like an issue requiring attention. Since I am launching the client application via Maven, I opted to change the maximum memory setting with export MAVEN_OPTS="-Xms256m -Xmx4m". To give an example, for three (3) iterations it worked fine with around 4 GB of memory

Re: How to perform efficient DataSet reuse between iterations

2017-12-06 Thread Fabian Hueske
Hmm, this does not look too good. As I expected, the program gets stuck in the optimizer. Plan optimization can be quite expensive for large plans. There might be a way to improve the optimization of large plans by cutting the plan space but I would not expect this to be fixed in the near future.

Re: How to perform efficient DataSet reuse between iterations

2017-12-06 Thread Miguel Coimbra
Hello Fabian, After increasing the message size akka parameter, the client resulted in the following exception after some time. This confirms that the JobManager never received the job request: [WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.i

Re: How to perform efficient DataSet reuse between iterations

2017-12-06 Thread Fabian Hueske
Hi Miguel, if the message size would be the problem, the client should fail with an exception. What might happen, is that the client gets stuck while optimizing the program. You could take a stacktrace of the client process to identify at which part the client gets stuck. Best, Fabian 2017-12-0

Re: How to perform efficient DataSet reuse between iterations

2017-12-05 Thread Miguel Coimbra
Hello Fabian, Thanks for the help. I am interested in the duration of specific operators, so the fact that parts of the execution are in pipeline is not a problem for me. >From my understanding, the automated way to approach this is to run the Flink job with the web interface active and then make

Re: How to perform efficient DataSet reuse between iterations

2017-11-29 Thread Fabian Hueske
The monitoring REST interface provides detailed stats about a job, its tasks, and processing verticies including their start and end time [1]. However, it is not trivial to make sense of the execution times because Flink uses pipelined shuffles by default. That means that the execution of multiple

Re: How to perform efficient DataSet reuse between iterations

2017-11-28 Thread Miguel Coimbra
Hello, You're right, I was overlooking that. With your suggestion, I now just define a different sink in each iteration of the loop. Then they all output to disk when executing a single bigger plan. I have one more question: I know I can retrieve the total time this single job takes to execute, b

Re: How to perform efficient DataSet reuse between iterations

2017-11-28 Thread Fabian Hueske
Hi, by calling result.count(), you compute the complete plan from the beginning and not just the operations you added since the last execution. Looking at the output you posted, each step takes about 15 seconds (with about 5 secs of initialization). So the 20 seconds of the first step include init

Re: How to perform efficient DataSet reuse between iterations

2017-11-28 Thread Miguel Coimbra
Hello Fabian, Thank you for the reply. I was hoping the situation had in fact changed. As far as I know, I am not calling execute() directly even once - it is being called implicitly by simple DataSink elements added to the plan through count(): System.out.println(String.format("%d-th graph algo

Re: How to perform efficient DataSet reuse between iterations

2017-11-27 Thread Fabian Hueske
Hi Miguel, I'm sorry but AFAIK, the situation has not changed. Is it possible that you are calling execute() multiple times? In that case, the 1-st and 2-nd graph would be recomputed before the 3-rd graph is computed. That would explain the increasing execution time of 15 seconds. Best, Fabian

How to perform efficient DataSet reuse between iterations

2017-11-26 Thread Miguel Coimbra
Hello, I'm facing a problem in an algorithm where I would like to constantly update a DataSet representing a graph, perform some computation, output one or more DataSink (such as a file on the local system) and then reuse the DataSet for a next iteration. ​I want to avoid spilling the results to d