Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-11-18 Thread Maximilian Michels
Yes, that does make sense! Thank you for explaining. Have you made the change yet? I couldn't find it on the master. On Wed, Nov 18, 2015 at 5:16 PM, Stephan Ewen wrote: > That makes sense... > > On Mon, Oct 26, 2015 at 12:31 PM, Márton Balassi >

RE: YARN High Availability

2015-11-18 Thread Gwenhael Pasquiers
Nevermind, Looking at the logs I saw that it was having issues trying to connect to ZK. To make I short is had the wrong port. It is now starting. Tomorrow I’ll try to kill some JobManagers *evil*. Another question : if I have multiple HA flink jobs, are there some points to check in order to

Fold vs Reduce in DataStream API

2015-11-18 Thread Ron Crocker
Is there a succinct description of the distinction between these transforms? Ron — Ron Crocker Principal Engineer & Architect ( ( •)) New Relic rcroc...@newrelic.com M: +1 630 363 8835

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Fabian Hueske
Hi Ron, Have you checked: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#transformations ? Fold is like reduce, except that you define a start element (of a different type than the input type) and the result type is the type of the initial value. In

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
Okay, I think I misunderstood your problem. Usually you can simply execute tests one after another by waiting until "env.execute()" returns. The streaming jobs terminate by themselves once the sources reach end of stream (finite streams are supported that way) but make sure all records flow

Re: Published test artifacts for flink streaming

2015-11-18 Thread Nick Dimiduk
Sorry Stephan but I don't follow how global order applies in my case. I'm merely checking the size of the sink results. I assume all tuples from a given test invitation have sunk before the next test begins, which is clearly not the case. Is there a way I can place a barrier in my tests to ensure

Re: Reading null value from datasets

2015-11-18 Thread Stephan Ewen
Hi Guido! If you use Scala, I would use an Option to represent nullable fields. That is a very explicit solution that marks which fields can be null, and also forces the program to handle this carefully. We are looking to add support for Java 8's Optional type as well for exactly that purpose.

Re: Checkpoints in batch processing & JDBC Output Format

2015-11-18 Thread Stephan Ewen
Hi! If you go with the Batch API, then any failed task (like a sink trying to insert into the database) will be completely re-executed. That makes sure no data is lost in any way, no extra effort needed. It may insert a lot of duplicates, though, if the task is re-started after half the data was

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-11-18 Thread Stephan Ewen
That makes sense... On Mon, Oct 26, 2015 at 12:31 PM, Márton Balassi wrote: > Hey Max, > > The solution I am proposing is not flushing on every record, but it makes > sure to forward the flushing from the sinkfunction to the outputformat > whenever it is triggered.

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected. On Wed, Nov 18, 2015 at 7:15 PM,

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
Okay, let me take a step back and make sure I understand this right... With many small files it takes longer to start the job, correct? How much time did it actually take and how many files did you have? On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier wrote: > in my

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits? On 18 Nov 2015 17:52, "Stephan Ewen" wrote: > Late answer, sorry: > > The splits are created in the JobManager, so the sub

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
in my test I was using the local fs (ext4) On 18 Nov 2015 19:17, "Stephan Ewen" wrote: > The JobManager does not read all files, but is has to query the HDFS for > all file metadata (size, blocks, block locations), which can take a bit. > There is a separate call to the HDFS

Re: Published test artifacts for flink streaming

2015-11-18 Thread Nick Dimiduk
Please see the above gist: my test makes no assertions until after the env.execute() call. Adding setParallelism(1) to my sink appears to stabilize my test. Indeed, very helpful. Thanks a lot! -n On Wed, Nov 18, 2015 at 9:15 AM, Stephan Ewen wrote: > Okay, I think I

Re: FlinkKafkaConsumer and multiple topics

2015-11-18 Thread Stephan Ewen
The new KafkaConsumer fro Kafka 0.9 should be able to handle this, as the Kafka Client Code itself has support for this then. For 0.8.x, we would need to implement support for recovery inside the consumer ourselves, which is why we decided to initially let the Job Recovery take care of that. If

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
it was long ago..but if I remember correctly they were about 50k On 18 Nov 2015 19:22, "Stephan Ewen" wrote: > Okay, let me take a step back and make sure I understand this right... > > With many small files it takes longer to start the job, correct? How much > time did it

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Kostas Tzoumas
Granted, both are presented with the same example in the docs. They are modeled after reduce and fold in functional programming. Perhaps we should have a bit more enlightening examples. On Wed, Nov 18, 2015 at 6:39 PM, Fabian Hueske wrote: > Hi Ron, > > Have you checked: >

Compiler Exception

2015-11-18 Thread Truong Duc Kien
Hi, I'm hitting Compiler Exception with some of my data set, but not all of them. Exception in thread "main" org.apache.flink.optimizer.CompilerException: No plan meeting the requirements could be created @ Bulk Iteration (Bulk Iteration) (1:null). Most likely reason: Too restrictive plan hints.

Re: Creating a representative streaming workload

2015-11-18 Thread Robert Metzger
Hey Vasia, I think a very common workload would be an event stream from web servers of an online shop. Usually, these shops have multiple servers, so events arrive out of order. I think there are plenty of different use cases that you can build around that data: - Users perform different actions

Re: Session Based Windows

2015-11-18 Thread Aljoscha Krettek
Hi Konstatin, you are right, if the stream is keyed by the session-id then it works. I was referring to the case where you have, for example, some interactions with timestamps and you want to derive the sessions from this. In that case, it can happen that events that should belong to one

Re: Session Based Windows

2015-11-18 Thread Konstantin Knauf
Hi Aljoscha, thanks, that's what I thought. Just wanted to verify, that keyBy + SessionWindow() works with intermingled events. Cheers, Konstantin On 18.11.2015 11:14, Aljoscha Krettek wrote: > Hi Konstatin, > you are right, if the stream is keyed by the session-id then it works. > > I was

Re: Session Based Windows

2015-11-18 Thread Vladimir Stoyak
We, were also trying to address session windowing but took slightly different approach as to what window we place the event into. We did not want "triggering event" to be purged as part of the window it triggered, but instead to create a new window for it and have the old window to fire and

Re: finite subset of an infinite data stream

2015-11-18 Thread Aljoscha Krettek
Hi, I wrote a little example that could be what you are looking for: https://github.com/dataArtisans/query-window-example It basically implements a window operator with a modifiable window size that also allows querying the current accumulated window contents using a second input stream.

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
There is no global order in parallel streams, it is something that applications need to work with. We are thinking about adding operations to introduce event-time order (at the cost of some delay), but that is only plans at this point. What I do in my tests is run the test streams in parallel,

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-11-18 Thread Welly Tambunan
agree, and Stateful Streaming operator instance in Flink is looks natural compare to Apache Spark. On Thu, Nov 19, 2015 at 10:54 AM, Liang Chen wrote: > Two aspects are attracting them: > 1.Flink is using java, it is easy for most of them to start Flink, and be > more