How to create a stream of data batches

2015-09-04 Thread Andres R. Masegosa
Hi, I'm trying to code some machine learning algorithms on top of flink such as a variational Bayes learning algorithms. Instead of working at a data element level (i.e. using map transformations), it would be far more efficient to work at a "batch of elements" levels (i.e. I get a batch of

Re: Usage of Hadoop 2.2.0

2015-09-04 Thread Matthias J. Sax
+1 for dropping On 09/04/2015 11:04 AM, Maximilian Michels wrote: > +1 for dropping Hadoop 2.2.0 binary and source-compatibility. The > release is hardly used and complicates the important high-availability > changes in Flink. > > On Fri, Sep 4, 2015 at 9:33 AM, Stephan Ewen

Re: How to create a stream of data batches

2015-09-04 Thread Stephan Ewen
Interesting question, you are the second to ask that. Batching in user code is a way, as Matthias said. We have on the roadmap a way to transform a stream to a set of batches, but it will be a bit until this is in. See

Re: Question on flink and hdfs

2015-09-04 Thread Maximilian Michels
Hi Jerry, If you don't want to use Hadoop, simply pick _any_ Flink version. We recommend the Hadoop 1 version because it contains the least dependencies, i.e. you need to download less and the installation occupies less space. Other than that, it doesn't really matter if you don't use the HDFS

Re: How to create a stream of data batches

2015-09-04 Thread Matthias J. Sax
Hi Andres, you could do this by using your own data type, for example > public class MyBatch { > private ArrayList data = new ArrayList > } In the DataSource, you need to specify your own InputFormat that reads multiple tuples into a batch and emits the whole batch at once. However, be

Convergence Criterion in IterativeDataSet

2015-09-04 Thread Andres R. Masegosa
Hi, I trying to implement some machine learning algorithms that involve several iterations until convergence (to a fixed point). My idea is to use a IterativeDataSet with an Aggregator which produces the result (i.e. a set of parameters defining the model). >From the interface

Re: Usage of Hadoop 2.2.0

2015-09-04 Thread Maximilian Michels
+1 for dropping Hadoop 2.2.0 binary and source-compatibility. The release is hardly used and complicates the important high-availability changes in Flink. On Fri, Sep 4, 2015 at 9:33 AM, Stephan Ewen wrote: > I am good with that as well. Mind that we are not only dropping a

BloomFilter Exception

2015-09-04 Thread Flavio Pompermaier
Hi to all, running a job with Flink 0/10-SNAPSHOT I got the following Exception: java.lang.IllegalArgumentException: expectedEntries should be > 0 at org.apache.flink.shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122) at

Re: Convergence Criterion in IterativeDataSet

2015-09-04 Thread Stephan Ewen
I think you can do this with the current interface. The convergence criterion object stays around, so you should be able to simply store the current aggregator value in a field (when the check is invoked). Any round but the first could compare against that field. On Fri, Sep 4, 2015 at 2:25 PM,

Re: Convergence Criterion in IterativeDataSet

2015-09-04 Thread Sachin Goel
Hi Andres Does something like this solve what you're trying to achieve? https://github.com/apache/flink/pull/918/files Regards Sachin On Sep 4, 2015 6:24 PM, "Stephan Ewen" wrote: > I think you can do this with the current interface. The convergence > criterion object stays

Using Flink with Redis question

2015-09-04 Thread Jerry Peng
Hello, So I am trying to use jedis (redis java client) with Flink streaming api, but I get an exception: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:452) at

Re: Using Flink with Redis question

2015-09-04 Thread Márton Balassi
Hey Jerry, Jay is on the right track, this issue has to do with the Flink operator life cycle. When you hit execute all your user defined classes get serialized, so that they can be shipped to the workers on the cluster. To execute some code before your FlatMapFunction starts processing the data

Fwd: Memory management issue

2015-09-04 Thread Ricarda Schueler
Hi All, We're running into a memory management issue when using the iterateWithTermination function. Using a small amount of data, everything works perfectly fine. However, as soon as the main memory is filled up on a worker, nothing seems to be happening any more. Once this happens, any worker

Re: Using Flink with Redis question

2015-09-04 Thread Jay Vyas
Maybe wrapping Jedis with a serializable class will do the trick? But in general is there a way to reference jar classes in flink apps without serializable them? > On Sep 4, 2015, at 1:36 PM, Jerry Peng wrote: > > Hello, > > So I am trying to use jedis (redis

Flink join with external source

2015-09-04 Thread Jerry Peng
Hello, Does a flink currently support operators to use redis? If I using the streaming api in Flink and I need to look up something in a redis database during the processing of the stream how can I do that?

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-04 Thread Stephan Ewen
We will definitely also try to get the chaining overhead down a bit. BTW: To reach this kind of throughput, you need sources that can produce very fast... On Fri, Sep 4, 2015 at 12:20 AM, Welly Tambunan wrote: > Hi Stephan, > > That's good information to know. We will hit

Re: Usage of Hadoop 2.2.0

2015-09-04 Thread Stephan Ewen
I am good with that as well. Mind that we are not only dropping a binary distribution for Hadoop 2.2.0, but also the source compatibility with 2.2.0. Lets also reconfigure Travis to test - Hadoop1 - Hadoop 2.3 - Hadoop 2.4 - Hadoop 2.6 - Hadoop 2.7 On Fri, Sep 4, 2015 at 6:19 AM, Chiwan

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-04 Thread Welly Tambunan
Hi Stephan, Cheers On Fri, Sep 4, 2015 at 2:31 PM, Stephan Ewen wrote: > We will definitely also try to get the chaining overhead down a bit. > > BTW: To reach this kind of throughput, you need sources that can produce > very fast... > > On Fri, Sep 4, 2015 at 12:20 AM, Welly

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-04 Thread Welly Tambunan
Hi Stephan, Thanks for your clarification. Basically we will have lots of sensor that will push this kind of data to queuing system ( currently we are using RabbitMQ, but will soon move to Kafka). We also will use the same pipeline to process the historical data. I also want to minimize the

Re: Question on flink and hdfs

2015-09-04 Thread Welly Tambunan
Hi Jerry, yes, that's possible. You can download the appropriate version https://flink.apache.org/downloads.html [image: Inline image 1] Cheers On Fri, Sep 4, 2015 at 1:57 AM, Jerry Peng wrote: > Hello, > > Does flink require hdfs to run? I know you can use hdfs