Python vs Scala - Performance

2015-06-29 Thread Maximilian Alber
Hi Flinksters, we had recently a discussion in our working group which Language we should use with Flink. To bring it to the point: most people would like to use Python because the are familiar with it and there is a nice scientific stack to f.e. print and analyse the results. But our concern is

Re: FileSystem exists

2015-06-29 Thread Stephan Ewen
You can also do myDir.getFileSystem().exists(myDir), but I don't think there is a shorter way... On Mon, Jun 29, 2015 at 12:39 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my job I have to check if a directory exists and currently I have to write: Path myDir = new

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
It is not quite easy to understand what you are trying to do. Can you post your program here? Then we can take a look and give you a good answer... On Mon, Jun 29, 2015 at 3:47 PM, Juan Fumero juan.jose.fumero.alfo...@oracle.com wrote: Is there any other way to apply the function in parallel

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Hi Stephan, so should I use another method instead of collect? It seems multithread is not working with this. Juan On Mon, 2015-06-29 at 14:51 +0200, Stephan Ewen wrote: Hi Juan! This is an artifact of a workaround right now. The actual collect() logic happens in the flatMap() and

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
Hi Juan! This is an artifact of a workaround right now. The actual collect() logic happens in the flatMap() and the sink is a dummy that executes nothing. The flatMap writes the data to be collected to the accumulator that delivers it back. Greetings, Stephan On Mon, Jun 29, 2015 at 2:30 PM,

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
In general, avoid collect if you can. Collect brings data top the client, where the computation is not parallel any more. Try to do as much on the DataSet as possible. On Mon, Jun 29, 2015 at 2:58 PM, Juan Fumero juan.jose.fumero.alfo...@oracle.com wrote: Hi Stephan, so should I use

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Is there any other way to apply the function in parallel and return the result to the client in parallel? Thanks Juan On Mon, 2015-06-29 at 15:01 +0200, Stephan Ewen wrote: In general, avoid collect if you can. Collect brings data top the client, where the computation is not parallel any

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Yeah, sorry. I would like to do something simple like this, but using Java Threads. DataSetTuple2Integer, Integer input = env.fromCollection(in); DataSetInteger output = input.map(new HighWorkLoad()); ArrayListInteger result = output.consume(); // ? like collect but in parallel, some operation

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Robert Metzger
@Stephan: I don't think there is a way to deal with this. In my understanding, the (main) purpose of the user@ list is not to report Flink bugs. It is a forum for users to help each other. Flink committers happen to know a lot about the system, so its easy for them to help users. Also, its a good

JobManager is no longer reachable

2015-06-29 Thread Flavio Pompermaier
Hi to all, I'm restarting the discussion about a problem I alredy dicussed on this mailing list (but that started with a different subject). I'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as: mvn clean install -Dhadoop.version=2.3.0-cdh5.1.3 -Dhbase.version=0.98.1-cdh5.1.3

Logs meaning states

2015-06-29 Thread Juan Fumero
Hi, I am starting with Flink. I have tried to look for the documentation but I havent found it clear. I wonder the difference between these two states: FlatMap RUNNING vs DataSink RUNNIG. FlatMap is doing data any data transformation? Compilation? In which point is actually executing the

Re: JobManager is no longer reachable

2015-06-29 Thread Stephan Ewen
Hi Flavio! I had a look at the logs. There seems nothing suspicious - at some point, the TaskManager and JobManager declare each other unreachable. A pretty common cause for that is that the JVMs stall for a long time due to garbage collection. The JobManager cannot see the difference between a

Re: JobManager is no longer reachable

2015-06-29 Thread Flavio Pompermaier
I think that actually there's an Exception thrown within the code that I suspect it's not reported anywhere..could it be? On Mon, Jun 29, 2015 at 3:28 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Which file and which JVM options do I have to modify to try options 1 and 3..? 1. Don't

Re: cogroup

2015-06-29 Thread Michele Bertoni
thanks both for answering, that’s what i expected I was using join at first but sadly i had to move from join to cogroup because I need outer join the alternative to the cogroup is to “complete” the inner join extracting from the original dataset what did not matched in the cogroup by

Re: cogroup

2015-06-29 Thread Michele Bertoni
ok thanks! then by now i will use it until true outer join is ready Il giorno 29/giu/2015, alle ore 18:22, Fabian Hueske fhue...@gmail.commailto:fhue...@gmail.com ha scritto: Yes, if you need outer join semantics you have to go with CoGroup. Some members of the Flink community are working on

Re: JobManager is no longer reachable

2015-06-29 Thread Stephan Ewen
Hi Flavio! Can you post the JobManager's log here? It should have the message about what is going wrong... Stephan On Mon, Jun 29, 2015 at 11:43 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, I'm restarting the discussion about a problem I alredy dicussed on this mailing

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

2015-06-29 Thread Andra Lungu
Something similar in flink-0.10-SNAPSHOT: 06/29/2015 10:33:46 CHAIN Join(Join at main(TriangleCount.java:79)) - Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED java.lang.Exception: The slot in which the task was executed has been released. Probably loss of

Re: Best way to write data to HDFS by Flink

2015-06-29 Thread Stephan Ewen
Hi Hawin! The performance tuning of Kafka is much trickier than that of Flink. Your performance bottleneck may be Kafka at this point, not Flink. To make Kafka fast, make sure you have the right setup for the data directories, and you set up zookeeper properly (for good throughput). To test the

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Vasiliki Kalavri
Thank you for the answer Robert! I realize it's a single JVM running, yet I would expect programs to behave in the same way, i.e. serialization to happen (even if not necessary), in order to catch this kind of bugs before cluster deployment. Is this simply not possible or is it a design choice we

Execution graph

2015-06-29 Thread Michele Bertoni
Hi, I was trying to run my program in the flink web environment (the local one) when I run it I get the graph of the planned execution but in each node there is a parallelism = 1”, instead i think it runs with par = 8 (8 core, i always get 8 output) what does that mean? is that wrong or is it