Re: Random Shuffling

2015-06-24 Thread Maximilian Alber
That's not the point. In Machine Learning one often divides a data set X into f.e. three sets, one for the training, one for the validation, one for the final testing. The sets are usually created randomly according to some ratio. Thus it would be important to keep the ratio and to do the whole

Re: Best way to write data to HDFS by Flink

2015-06-24 Thread Stephan Ewen
Hi Hawin! If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink. It seems like a very common use case, so this functionality will be useful to other user as well! Greetings, Stephan On Tue, Jun 23, 2015 at

Re: Random Shuffling

2015-06-24 Thread Sebastian
A very simple way to achieve is to generate a random variate on the driver that describes a mapping of datapoints to samples. Then you simply join the dataset with this mapping to generate the samples. This approach requires you to know the size of the dataset in advance, but has the

Re: Connecting the channel failed: Connection refused

2015-06-24 Thread Ufuk Celebi
Hey Aaron, thanks for preparing the example. I've checked it out and tried it with a similar setup (12 task managers with 1 slots each, running the job with parallelism of 12). I couldn't reproduce the problem. What have you configured in the slaves file? I think Flink does not allow you to

Re: Documentation Error

2015-06-24 Thread Maximilian Michels
Hi Max, Thanks for noticing! Fixed on the master and for the 0.9.1 release. Cheers, Max On Tue, Jun 23, 2015 at 5:09 PM, Maximilian Alber alber.maximil...@gmail.com wrote: Hi Flinksters, just some minor: http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html in the

[ANNOUNCE] Apache Flink 0.9.0 released

2015-06-24 Thread Maximilian Michels
The Apache Flink community is pleased to announce the availability of the 0.9.0 release. Apache Flink is an open source platform for scalable batch and stream data processing. Flinkā€™s core consists of a streaming dataflow engine that provides data distribution, communication, and fault tolerance

Re: Connecting the channel failed: Connection refused

2015-06-24 Thread Aaron Jackson
Thanks. My setup is actually 3 task managers x 4 slots. I played with the parallelism and found that at low values, the error did not occur. I can only conclude that there is some form of data shuffling that is occurring that is sensitive to the data source. Yes, seems a little odd to me as

Pergem exception from web-client

2015-06-24 Thread Flavio Pompermaier
Hi to all, I'm facing an OutOfMemoryError: PermGen space running multiple times my job from the web client interface. Where do I need to increase it? The full stacktrace is: org.apache.flink.client.program.ProgramInvocationException: The program's entry point class '' caused an exception

using value that are not passed as parameter

2015-06-24 Thread Michele Bertoni
Hi everybody, this question may sounds stupid, but i would like to have it clear what happens if inside a dataset transformation (e.g. a map) I use something that is declared somewhere else, like a variable or a dataset, and not passed as broadcast dataset nor parameter in the constructor of a

Re: Pergem exception from web-client

2015-06-24 Thread Flavio Pompermaier
ok thanks Matthias On 24 Jun 2015 21:00, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, you need to increase JVM parameter -XX:MaxPermSize= The default value should be something like 64m Just add the flag to variable JVM_ARGS in bin/webclient.sh (line 33). - Compare

Re: Connecting the channel failed: Connection refused

2015-06-24 Thread Stephan Ewen
Aaron, Can you check how the TaskManagers register at the JobManager? When you look at the 'TaskManagers' section in the JobManager's web Interface (at port 8081), what does it say as the TaskManager host names? Does it list host1, host2, host3...? Thanks, Stephan Am 24.06.2015 20:31 schrieb