Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-21 Thread Fabian Hueske
Hi Garrett, I agree, there seems to be an issue and increasing the timeout should not be the right approach to solve it. Are you running streaming or batch jobs, i.e., do some of the tasks finish much earlier than others? I'm adding Till to this thread who's very familiar with scheduling and

Re: Control insert database with dataset

2018-06-21 Thread Fabian Hueske
Hi Dulce, This functionality is not supported by the JDBCOutputFormat. Some database systems (AFAIK, MySQL) support Upsert writes, i.e., writes that insert if the primary key is not present or update the row if the PK exists. Not sure if that would meet your requirements. If you don't want to go

Re: A question about Kryo and Window State

2018-06-21 Thread Fabian Hueske
Hi Vishal, In general, Kryo serializers are not very upgrade friendly. Serializer compatibility [1] might be right approach here, but Gordon (in CC) might know more about this. Best, Fabian [1]

Re: Passing records between two jobs

2018-06-20 Thread Fabian Hueske
Hi Avihai, Rafi pointed out the two common approaches to deal with this situation. Let me expand a bit on those. 1) Transactional producing in to queues: There are two approaches to accomplish exactly-once producing into queues, 1) using a system with transactional support such as Kafka or 2)

Re: # of active session windows of a streaming job

2018-06-20 Thread Fabian Hueske
gt; see an incorrect value from a dashboard. > This is the biggest concern of mine at this point. > > Best, > > - Dongwon > > > On Tue, Jun 19, 2018 at 7:14 PM, Fabian Hueske wrote: > >> Hi Dongwon, >> >> Do you need to number of active

Re: DataStreamCalcRule$1802" grows beyond 64 KB when execute long sql.

2018-06-19 Thread Fabian Hueske
. Is it hard to implement ? I am a new to flink table api & sql. > > Best Minglei. > > 在 2018年6月19日,下午10:36,Fabian Hueske 写道: > > Hi, > > Which version are you using? We fixed a similar issue for Flink 1.5.0. > If you can't upgrade yet, you can also implement a user-def

Re: DataStreamCalcRule$1802" grows beyond 64 KB when execute long sql.

2018-06-19 Thread Fabian Hueske
Hi, Which version are you using? We fixed a similar issue for Flink 1.5.0. If you can't upgrade yet, you can also implement a user-defined function that evaluates the big CASE WHEN statement. Best, Fabian 2018-06-19 16:27 GMT+02:00 zhangminglei <18717838...@163.com>: > Hi, friends. > > When I

Re: Stream Join With Early firings

2018-06-19 Thread Fabian Hueske
oach wasn't driven by the requirements but by operational > aspects (state size), so using a concept like idle state retention time > would be a more natural fit. > > Thanks, > > Johannes > > On Mon, Jun 18, 2018 at 9:57 AM Fabian Hueske wrote: > >> Hi Johannes,

Re: # of active session windows of a streaming job

2018-06-19 Thread Fabian Hueske
tate inside a trigger. > TriggerContext only allows to interact with state that is scoped to the > window and the key of the current trigger invocation (as shown in > Trigger#TriggerContext) > > Now I've come to a conclusion that it might not be possible using > DataStream API. >

Re: Flink application does not scale as expected, please help!

2018-06-19 Thread Fabian Hueske
h > more TM in play. > > @Ovidiu question is interesting to know too. @Till do you mind to share > your thoughts? > > Thank you guys! > > -- > *From:* Ovidiu-Cristian MARCU > *Sent:* Monday, June 18, 2018 6:28 PM > *To:* Fabian Hueske >

Re: Flink application does not scale as expected, please help!

2018-06-18 Thread Fabian Hueske
perfectly 1-2-4-8-16 because all happens in same TM. When > scale to 32 the performance drop, not even in par with case of parallelism > 16. Is this something expected? Thank you. > > Regards, > Yow > > -- > *From:* Fabian Hueske > *Sent:* Mon

Re: Stream Join With Early firings

2018-06-18 Thread Fabian Hueske
Hi Johannes, EventTimeSessionWindows [1] use the EventTimeTrigger [2] as default trigger (see EventTimeSessionWindows.getDefaultTrigger()). I would take the EventTimeTrigger and extend it with early firing functionality. However, there are a few things to consider * you need to be aware that

Re: Flink application does not scale as expected, please help!

2018-06-18 Thread Fabian Hueske
Hi, Which Flink version are you using? Did you try to analyze the bottleneck of the application, i.e., is it CPU, disk IO, or network bound? Regarding the task scheduling. AFAIK, before version 1.5.0, Flink tried to schedule tasks on the same machine to reduce the amount of network transfer.

Re: Restore state from save point with add new flink sql

2018-06-15 Thread Fabian Hueske
Hi, At the moment (Flink 1.5.0), the operator UIDs depend on the overall application and not only on the query. Hence, changing the application by adding another query might change the existing UIDs. In general, you can only expect savepoint restarts to work if you don't change the application

[ANNOUNCE] Registration for Flink Forward Berlin is open

2018-06-14 Thread Fabian Hueske
Hi everyone, *Flink Forward Berlin 2018 will take place from September 3rd to 5th.* The conference will start with one day of training and continue with two days of keynotes and talks. *The registration for Flink Forward Berlin 2018 is now open!* A limited amount of early-bird passes is

Re: IoT Use Case, Problem and Thoughts

2018-06-14 Thread Fabian Hueske
s caused by the fact > that Memory states are large as it is throwing error states are larger than > certain size. So solution of (1) will possibly solve (2) as well. > > Thanks again, > > Ashish > > > On Jun 7, 2018, at 4:25 PM, Fabian Hueske wrote: > > H

Re: How to submit two Flink jobs to the same Flink cluster?

2018-06-12 Thread Fabian Hueske
Hi Angelica, The Flink cluster needs to provide a sufficient number of slots to process the tasks of all submitted jobs. Besides that there is no limit. However, if you run super many jobs, you might need to tune a few configuration parameters. Best, Fabian 2018-06-12 8:46 GMT+02:00 Sampath

Re: Flink multiple windows

2018-06-10 Thread Fabian Hueske
Hi Antonio, Cascading window aggregations as done in your example is a good idea and is preferable if the aggregation function is combinable, which is true for sum (count can be done as sum of 1s). Best, Fabian 2018-06-09 4:00 GMT+02:00 antonio saldivar : > Hello > > Has anyone work this way?

Re: Late data before window end is even close

2018-06-08 Thread Fabian Hueske
ctual timestamps of their input data. For me it was helpful to make > this change in my Flink job: for late data output, include both processing > time (DateTime.now()) along with the event time (original timestamp). > > On Mon, May 14, 2018 at 12:42 PM, Fabian Hueske wrote: > >

Re: IoT Use Case, Problem and Thoughts

2018-06-07 Thread Fabian Hueske
Hi Ashish, Thanks for the great write up. If I understood you correctly, there are two different issues that are caused by the disabled checkpointing. 1) Recovery from a failure without restarting all operators to preserve the state in the running tasks 2) Planned restarts an application without

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-06 Thread Fabian Hueske
Hi Turar, Managed state is a general concept in Flink's DataStream API and not specifically designed for windows (although they use internally). I'd recommend the broadcast state that Aljoscha proposed. It was specifically designed for these use cases. It is true that the state is currently

Re: TaskManager use more memory than Xmx

2018-06-06 Thread Fabian Hueske
Hi, Flink uses a few libraries that allocate direct (off-heap) memory (Netty, RocksDB). Flink can also allocate direct memory by itself (only relevant for batch setups though). Therefore, Xmx controls only one part of Flink's memory footprint. Best, Fabian 2018-06-04 16:48 GMT+02:00 aitozi : >

Re: why does flink release package preferred uber jar than small jar?

2018-06-06 Thread Fabian Hueske
One reason is that we shade away several of dependencies to avoid version conflicts with user dependencies or dependencies of internal dependencies. Best, Fabian 2018-06-05 4:07 GMT+02:00 makeyang : > thanks rongrong, but it seems unrelevant. > > > > -- > Sent from:

Re: NPE in flink sql over-window

2018-06-05 Thread Fabian Hueske
28150685815] [label:state_timeout] ontimer at > 1528150685813(CLEANUP_TIME_2), clean/empty the rowMapState > [{key:1528149885813, value:[Row:(orderId:002,userId:U123)]}]* > > > > > > > > > [1] : https://issues.apache.org/jira/browse/FLINK-9524 > > > -

Re: Checkpointing when reading from files?

2018-06-05 Thread Fabian Hueske
Hi, The continuous file source is split into two components. 1) A split generator that monitors a directory and generates splits when a new file is observed, and 2) reading tasks that receive splits and read the referenced files. I think this is the code that generates input splits which are

Re: JVM metrics disappearing after job crash, restart

2018-06-04 Thread Fabian Hueske
Hi Nik, Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing? If so, would you mind leaving a comment there? Thank you, Fabian [1] https://issues.apache.org/jira/browse/FLINK-8946 2018-05-31 4:41 GMT+02:00 Nikolas Davis : > We keep track of

[ANNOUNCE] Flink Forward Berlin 2018 - Call for Presentations extended until June 11

2018-06-01 Thread Fabian Hueske
Hi everybody, Due to popular demand, we've extended the Call for Presentations for Flink Forward Berlin 2018 by one week. The call will close on *Monday, June 11* (11:59pm CEST). Please submit a proposal to present your Flink and Stream Processing use case, experiences, and best practices in

Re: Multiple Task Slots support in Flink 1.5

2018-06-01 Thread Fabian Hueske
Hi, The release notes state that "multiple slots are not *fully* supported". In Flink 1.5.0, the configured number of slots is ignored when requesting containers for TaskManagers from a resource manager, i.e., Flink assumes TMs with 1 slot. Hence, Flink request too many containers and starts too

Re: TimerService/Watermarks and Checkpoints

2018-06-01 Thread Fabian Hueske
the out of order events number is very high > though :thinking_face: > > > > > > On Wed, May 30, 2018 at 1:55 PM, Fabian Hueske wrote: > >> Hi Nara and Sihua, >> >> That's indeed an unexpected behavior and it would be good to identify the >> reason for

Re: Use element of the DataStream in parameter of RichMapFunction (open function not called)

2018-05-31 Thread Fabian Hueske
Hi Isabelle, Welcome to the Flink user mailing list! You are mixing up the two ways to specify a function: 1. Defining a function as a class / object and passing an instance in the map() method. Given your CustomMapFunction class, this looks as follows: stream.keyBy(...).map(new

Re: NPE in flink sql over-window

2018-05-31 Thread Fabian Hueske
or ProcTimeBoundedRangeOver. I will update > with my test result and fire a JIRA after that. > > > Best > > Yan > ------ > *From:* Fabian Hueske > *Sent:* Wednesday, May 30, 2018 1:43:01 AM > *To:* Dawid Wysakowicz > *Cc:* user > *Subje

Re: NPE in flink sql over-window

2018-05-30 Thread Fabian Hueske
Hi, Dawid's analysis is certainly correct, but looking at the code this should not happen. I have a few questions: - You said this only happens if you configure idle state retention times, right? - Does the error occur the first time without a previous recovery? - Did you run the same query on

Re: env.execute() ?

2018-05-29 Thread Fabian Hueske
Hi, It is mandatory for all DataStream programs and most DataSet programs. Exceptions are ExecutionEnvironment.print() and ExecutionEnvironment.collect(). Both methods are defined on the DataSet ExecutionEnvironment and call execute() internally. Best, Fabian 2018-05-29 12:31 GMT+02:00 Esa

[ANNOUNCE] Final Reminder - Call for Presentations - Flink Forward Berlin 2018

2018-05-28 Thread Fabian Hueske
Hi all, This is the final reminder for the call for presentations for Flink Forward Berlin 2018. *The call closes in 7 days* (June 4, 11:59 PM CEST). Submit your talk and get to present your Apache Flink and stream processing ideas, experiences, use cases, and best practices on September 4-5 in

Re: Writing Table API results to a csv file

2018-05-28 Thread Fabian Hueske
Hi, Jörn is probably right. In contrast to print(), which immediately triggers an execution, writeToSink() just appends a sink operator and requires to explicitly trigger the execution. The INFO messages of the TypeExtractor are "just" telling you, that Row cannot be used as a POJO type, but

Re: Clarification in TumblingProcessing TimeWindow Documentation

2018-05-28 Thread Fabian Hueske
I agree, this should be fixed. Thanks for noticing, Dhruv. Would you mind creating a JIRA for this? Thank you, Fabian 2018-05-28 8:39 GMT+02:00 Bowen Li : > Hi Dhruv, > > I can see it's confusing, and it does seem the comment should be improved. > You can find concrete

Re: Large number of sources in Flink Job

2018-05-28 Thread Fabian Hueske
Hi Chirag, There have been some issue with very large execution graphs. You might need to adjust the default configuration and configure larger Akka buffers and/or timeouts. Also, 2000 sources means that you run at least 2000 threads at once. The FileInputFormat (and most of its sub-classes) in

Re: [ANNOUNCE] Apache Flink 1.5.0 release

2018-05-28 Thread Fabian Hueske
Thank you Till for serving as a release manager for Flink 1.5! 2018-05-25 19:46 GMT+02:00 Till Rohrmann : > Quick update: I had to update the date of the release blog post which also > changed the URL. It can now be found here: > >

Re: How to restore state from savepoint with flink SQL

2018-05-24 Thread Fabian Hueske
epends on its > location in the graph and its input/output. > > > Best > > Yan > ------ > *From:* Fabian Hueske <fhue...@gmail.com> > *Sent:* Wednesday, May 23, 2018 3:18:08 AM > *To:* Yan Zhou [FDS Science] > *Cc:* user@flink.apache.org >

Re: When is 1.5 coming out

2018-05-23 Thread Fabian Hueske
Hi Vishal, Release candidate 5 (RC5) has been published and the voting period ends later today. Unless we find a blocking issue, we can push the release out today. FYI, if you are interested in the release progress, you can subscribe to the dev mailing list (or just check out the archives at

Re: Fwd: Decrease initial source read speed

2018-05-23 Thread Fabian Hueske
Hi Andrei, With the current version of Flink, there is no general solution to this problem. The upcoming version 1.5.0 of Flink adds a feature called credit-based flow control which might help here. I'm adding @Piotr to this thread who knows more about the details of this new feature. Best,

Re: Order of events with chanined keyBy calls of same key

2018-05-23 Thread Fabian Hueske
Hi, I've posted an answer on SO. Best, Fabian 2018-05-22 18:11 GMT+02:00 Shimony, Shay : > Hi everyone, > > > > I have this question in StackOverflow, and would be happy if you could > answer. > > https://stackoverflow.com/questions/50340107/order-of- >

Re: program args size for running jobs

2018-05-23 Thread Fabian Hueske
Hi Esteban, If you need the parameters to configure specific operators (instead of the over all flow), you could pass the parameters as a file using the distributed cache [1]. Note, the docs point to the DataSet (batch) API, but the feature works the same way for DataStream programs as well.

Re: How to restore state from savepoint with flink SQL

2018-05-23 Thread Fabian Hueske
Hi, At the moment, you can only restore a query from a savepoint if the query is not modified and the same Flink version is used. Since SQL queries are automatically translated into data flows, it is not transparent to the user, which operators will be created. We would need to expose an

Re: Limitations with Retract Streams on SQL

2018-05-23 Thread Fabian Hueske
Hi Gregory, Rong's analysis is correct. The UNION with duplicate elimination is translated into a UNION ALL and a subsequent grouping operator on all attributes without an aggregation function. Flink assumes that all grouping operators can produce retractions (updates) and window-grouped

Flink Forward Berlin 2018 - Call for Presentations open until 4th June

2018-05-18 Thread Fabian Hueske
Hi all, Flink Forward returns to Berlin on September 3-5, 2018. We are happy to announce the Call for Presentations is now open! Please submit a proposal if you'd like to present your Apache Flink experience, best practices or use case in front of an international audience of highly skilled and

Re: chained operator with different parallelism question

2018-05-18 Thread Fabian Hueske
Functions with different parallelism cannot be chained. Chaining means that Functions are fused into a single operator and that records are passed by method calls (instead of serializing them into an in-memory or network channel). Hence, chaining does not work if you have different parallelism.

Re: Best way to clean-up states in memory

2018-05-16 Thread Fabian Hueske
<Integer, > Long>> { > private static final long serialVersionUID = 1L; > > @Override > public Tuple2<Integer, Long> reduce(Tuple2<Integer, Long> tup1, > Tuple2<Integer, Long> tup2) throws Exception { > Tuple2<Integer, Long> retTup = tup1;

Re: Consumed input splits

2018-05-16 Thread Fabian Hueske
I think this would be a very good feature. There's a pretty old JIRA for it [1]. It's even from pre-Apache times because it was imported from the original Github repository. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-766 2018-05-14 16:46 GMT+02:00 Flavio Pompermaier

Re: Best way to clean-up states in memory

2018-05-14 Thread Fabian Hueske
Hi Ashish, Did you use per-window state (also called partitioned state) in your Trigger? If yes, you need to make sure that it is completely removed in the clear() method because processing time timers won't fire once a window was purged. So you cannot (fully) rely on timers to clean up

Re: Late data before window end is even close

2018-05-14 Thread Fabian Hueske
ave >> any idea on this? Is there some existing test that simulates out of order >> input to flink's kafka consumer? I could try to build a test case based on >> that to possibly reproduce my problem. I'm not sure how to gather enough >> debug information on the production str

Re: Batch writing from Flink streaming job

2018-05-14 Thread Fabian Hueske
Hi, Avro provides schema for data and can be used to serialize individual records in a binary format. It does not compress the data (although this can be put on top) but is more space efficient due to the binary serialization. I think you can implement a Writer for the BucketingSink that writes

Re: Late data before window end is even close

2018-05-11 Thread Fabian Hueske
Hi Juho, Thanks for bringing up this topic! I share your intuition. IMO, records should only be filtered out and send to a side output if any of the windows they would be assigned to is closed already. I had a look into the code and found that records are filtered out as late based on the

Re: Application logs missing from jobmanager log

2018-05-11 Thread Fabian Hueske
Great, thank you! 2018-05-11 10:31 GMT+02:00 Juho Autio <juho.au...@rovio.com>: > Thanks Fabian, here's the ticket: > https://issues.apache.org/jira/browse/FLINK-9335 > > On Wed, May 2, 2018 at 12:53 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Juh

Re: Latency with cross operation on Datasets

2018-05-11 Thread Fabian Hueske
Hi Varun, The focus of the DataSet execution is on robustness. The smaller DataSet is stored serialized in memory. Also most of the communication happens via serialization (instead of passing object references). The serialization overhead should have a significant overhead compared to a

Re: Storm topology running in flink.

2018-05-11 Thread Fabian Hueske
Hi, >From the dev perspective there hasn't been done much on that component since a long time [1]. Are there any users of this feature on the user list and can comment on how it works for them? Best, Fabian [1] https://github.com/apache/flink/commits/master/flink-contrib/flink-storm 2018-05-11

Re: Wiring batch and stream together

2018-05-11 Thread Fabian Hueske
Hi Peter, Building the state for a DataStream job in a DataSet (batch) job is currently not possible. You can however, implement a DataStream job that reads batch data and builds the state. When all data was processed, you'd need to save the state as a savepoint and can resume a streaming job

Re: Reading csv-files in parallel

2018-05-09 Thread Fabian Hueske
L together in Scala code ? > > > > Best, Esa > > > > *From:* Fabian Hueske <fhue...@gmail.com> > *Sent:* Tuesday, May 8, 2018 10:26 PM > > *To:* Esa Heikkinen <esa.heikki...@student.tut.fi> > *Cc:* user@flink.apache.org > *Subject:* Re: Reading csv-files in pa

Re: Reading csv-files in parallel

2018-05-08 Thread Fabian Hueske
tructure of main program ? > > I did mean, if I want to read many csv-files and I have certain > consecutive reading order of them. Is that possible and how ? > > > > Actually I want to implement upper level (state-machine-based) logic for > reading csv-files by certain order.

Re: Lost JobManager

2018-05-08 Thread Fabian Hueske
are running a custom build of Flink. Which version did you base your build on? Best, Fabian 2018-05-08 17:41 GMT+02:00 Chan, Regina <regina.c...@gs.com>: > There’s no collect() explicitly from me. It has a cogroup operator before > writing to DataSink. > > > > > > *Fro

Re: Processing Sorted Input Datasets

2018-05-08 Thread Fabian Hueske
Hi Helmut, In fact this is possible with the DataSet API. However, AFAIK it is an undocumented feature and probably not widely used. You can do this by specifying so-called SplitDataProperties on a DataSource as follows: DataSource src = env.createInput(...); SplitDataProperties splitProps =

Re: Reading csv-files in parallel

2018-05-08 Thread Fabian Hueske
at read csv-files > parallel ? > > > > Best, Esa > > > > *From:* Fabian Hueske <fhue...@gmail.com> > *Sent:* Monday, May 7, 2018 3:48 PM > *To:* Esa Heikkinen <esa.heikki...@student.tut.fi> > *Cc:* user@flink.apache.org > *Subject:* Re: Reading csv-

Re: Signal for End of Stream

2018-05-07 Thread Fabian Hueske
Hi, Flink will automatically stop the execution of a DataStream program once all sources have finished to provide data, i.e., when all SourceFunction return from the run() method. The DeserializationSchema.isEndOfStream() method can be used to tell a built-in SourceFunction such as a

Re: Reading csv-files in parallel

2018-05-07 Thread Fabian Hueske
Hi Esa, you can certainly read CSV files in parallel. This works very well in a batch query. For streaming queries, that expect data to be ingested in timestamp order this is much more challenging, because you need 1) read the files in the right order and 2) cannot split files (unless you

Re: Standalone HA Cluster using Shared Zookeeper

2018-05-07 Thread Fabian Hueske
Hi Andre, Sharing a Zookeeper cluster between Kafka and Flink should be OK. If you're running just one cluster, you could in principle keep the default. However, I'd change the configuration just in case. Otherwise, you might get into trouble when you (accidentally) run another Flink setup.

Re: strange behavior with jobmanager.rpc.address on standalone HA cluster

2018-05-07 Thread Fabian Hueske
Hi Derek, 1. I've created a JIRA issue to improve the docs as you recommended [1]. 2. This discussion goes quite a bit into the internals of the HA setup. Let me pull in Till (in CC) who knows the details of HA. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9309 2018-05-05

Re: YARN per-job cluster reserves all of remaining memory in YARN

2018-05-07 Thread Fabian Hueske
Hi Dongwon, I see that you are using the latest master (Flink 1.6-SNAPSHOT). This is a known problem in the new FLIP-6 mode. The ResourceManager tries to allocate too many resources, basically on TM per required slot, i.e., it does not take the number of slots per TM into account. The resources

Re: Why FoldFunction deprecated?

2018-05-07 Thread Fabian Hueske
Hi, FoldFunction was deprecated because it doesn't support partial aggregation. AggregateFunction is much more expressive, however requires a bit more implementation effort. In favor of a concise API, FoldFunction was deprecated because it doesn't offer more functionality than AggregateFunction.

Re: Init RocksDB state backend during startup

2018-05-07 Thread Fabian Hueske
Hi Peter, State initialization with with historic data is a use case that's coming up more and more. Unfortunately, there's no good solution for this yet but just a couple of workaround that require careful design and work for all cases. There was a talk about exactly this problem and some ideas

Re: FLIP-6 Docker / Kubernetes

2018-05-07 Thread Fabian Hueske
Hi, Most, but not all, of the FLIP-6 features will be released with Flink 1.5.0. I'm not sure if this deployment mode will be fully supported in 1.5.0. Gary (in CC) might know details here. Anyway, the deployment would work by starting the image using regular Docker/Kubernetes tools. The image

Re: Lost JobManager

2018-05-07 Thread Fabian Hueske
Hi Regina, I see from the logs that you are using the DataSet API. Are you trying to fetch a large result to your client using the collect() method? Best, Fabian 2018-05-02 0:38 GMT+02:00 Chan, Regina : > Hi, > > > > I’m running a single TM with the following params -yn 1

Re: Stashing key with AggregateFunction

2018-05-06 Thread Fabian Hueske
mples in the documentation referenced below > have a number of bugs, see FLINK-9299 > <https://issues.apache.org/jira/browse/FLINK-9299>. > > > On May 4, 2018, at 7:35 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Ken, > > You can also

Re: Use of AggregateFunction's merge() method

2018-05-06 Thread Fabian Hueske
Hi Ken, You are right. The merge() method combines partial aggregates, similar to a combinable reducer. The only situation when merge() is called in a DataStream job (that I am aware of) is when session windows get merged. For example when you define a session window with 30 minute gap and you

Re: Question about datasource replication

2018-05-04 Thread Fabian Hueske
isk after the > flatMaps? > > On Fri, May 4, 2018 at 6:02 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> That will happen if you join (or coGroup) the branched DataSets, i.e., >> you have branching and merging pattern in your stream. >> >> The problem in th

Re: Question about datasource replication

2018-05-04 Thread Fabian Hueske
usage during the job and it was corresponding exactly to the size estimated > by the Flink UI, that is twice it's initial size). > Probably there are no problem until you keep data in memory but in my case > it's very problematic this memory explosion :( > > On Fri, May 4, 2018 at 5:14 PM, Fabian Huesk

Re: Question about datasource replication

2018-05-04 Thread Fabian Hueske
Hi Flavio, No, there's no way around it. DataSets that are processed by more than one operator cannot be processed by chained operators. The records need to be copied to avoid concurrent modifications. However, the data should not be shipped over the network if all operators have the same

Re: Stashing key with AggregateFunction

2018-05-04 Thread Fabian Hueske
Hi Ken, You can also use an additional ProcessWindowFunction [1] that is applied on the result of the AggregateFunction to set the key. Since the PWF is only applied on the final result, there no overhead (actually, it might even be slightly cheaper because the AggregateFunction can be simpler).

Re: Stream.union doesn't change the parallelism of the new stream?

2018-05-03 Thread Fabian Hueske
Hi, Union is not an actual operator in Flink. Instead, the operator that is applied on the unioned stream ingests its input from all unioned streams. The parallelism of that operator is the configured default parallelism (can be specified at the execution environment) unless it is explicitly

Re: Window over events defined by a time range

2018-05-03 Thread Fabian Hueske
Hi, Flink can add events to mulitple windows. For instance, the built-in sliding windows are doing this. You can address your use case by implementing a custom WindowAssigner [1]. Best, Fabian [1]

Re: KafkaProducer with generic (Avro) serialization schema

2018-05-02 Thread Fabian Hueske
Hi Wouter, you can try to make the SerializationSchema serializable by overriding Java's serialization methods writeObject() and readObject() similar as Flink's AvroRowSerializationSchema [1] does. Best, Fabian [1]

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread Fabian Hueske
It's not a requirement but the exception reads "org.apache.flink.runtime. client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.". So increasing the timeout might help. Best, Fabian 2018-05-02 12:20 GMT+02:00 m@xi : > Hello Fabian! > > Thanks

Re: Windows aligned with EDT ( with day light saving )

2018-05-02 Thread Fabian Hueske
Hi Vishal, AFAIK it is not possible with Flink's default time windows. However, it should be possible to implement a custom WindowAssigner for your use case. I'd have a look at the TumblingEventTimeWindows class and copy/modify it to your needs. Best, Fabian 2018-05-02 15:12 GMT+02:00 Vishal

Re: Setting the parallelism in a cluster of machines properly

2018-05-02 Thread Fabian Hueske
Hi, did you try to increase the Akka timeout [1]? Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#distributed-coordination-via-akka 2018-04-29 19:44 GMT+02:00 m@xi : > Guys seriously I have done the process as described in the

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-02 Thread Fabian Hueske
Hi Amit, We recently fixed a bug in the network stack that affected batch jobs (FLINK-9144). The fix was added after your commit. Do you have a chance to build the current release-1.5 branch and check if the fix also resolves your problem? Otherwise it would be great if you could open a blocker

Re: Application logs missing from jobmanager log

2018-05-02 Thread Fabian Hueske
Hi Juho, I assume that these logs are generated from a different process, i.e., the client process and not the JM or TM process. Hence, they end up in a different log file and are not covered by the log collection of the UI. The reason is that this process might also be run on a machine outside

Re: coordinate watermarks between jobs?

2018-05-02 Thread Fabian Hueske
Hi Tao, The watermarks of operators that consume from two (or more) streams are always synced to the lowest watermark. This behavior guarantees that data won't be late (unless it was late when watermarks were assigned). However, the operator will most likely need to buffer more events from the

Re: Multiple Streams Connect Watermark

2018-04-26 Thread Fabian Hueske
> > Best, > Chengzhi > > On Thu, Apr 26, 2018 at 6:05 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Chengzhi, >> >> Functions in Flink are implemented in a way to preserve the timestamps of >> elements or assign timestamps which are aligned

Re: data enrichment with SQL use case

2018-04-26 Thread Fabian Hueske
issue. > > Regards, > > — Ken > > > On Apr 25, 2018, at 12:39 PM, Michael Latta <lat...@me.com> wrote: > > Using a flat map function, you can always buffer the non-meta data stream > in the operator state until the metadata is aggregated, and then process > an

Re: Help with OneInputStreamOperatorTestHarness

2018-04-26 Thread Fabian Hueske
Thanks for reporting the issue Chris! Would you mind opening a JIRA issue [1] to track the bug for Flink 1.4? Thank you, Fabian [1] https://issues.apache.org/jira/browse/FLINK 2018-04-25 21:11 GMT+02:00 Chris Schneider : > Hi Gang, > > FWIW, the code below works

Re: Multiple Streams Connect Watermark

2018-04-26 Thread Fabian Hueske
Hi Chengzhi, Functions in Flink are implemented in a way to preserve the timestamps of elements or assign timestamps which are aligned with the existing watermarks. For example, the result of a time window aggregation has the end timestamp of the window as a timestamp and records emitted by the

Re: Why assignTimestampsAndWatermarks parallelism same as map,it will not fired?

2018-04-25 Thread Fabian Hueske
Hi, This sounds like one of the partitions does not receive data. Watermark generation is data driven, i.e., the watermark can only advance if the TimestampAndWatermarkAssigner sees events. By changing the parallelism between the map and the assigner, the events are shuffled across and hence

Re: Use gradle with flink

2018-04-24 Thread Fabian Hueske
You can certainly setup and build Flink applications with Gradle. However the bad news is, the Flink project does not provide a pre-configured Gradle project/configuration yet. The good news is, the community is working on that [1] and there's already a PR [2] (opened 19 hours ago). Btw. besides

Re: KafkaJsonTableSource purpose

2018-04-24 Thread Fabian Hueske
Hi Sebastien, I think you can do that with Flink's Table API / SQL and the KafkaJsonTableSource. Note that in Flink 1.4.2, the KafkaJsonTableSource does not support flat JSON yet. You'd also need a table-valued UDFs for the parsing of the message and joining the result with the original row.

Re: data enrichment with SQL use case

2018-04-24 Thread Fabian Hueske
unds, that must be helpful for my case as well > > Thank you, > Alex > > On Mon, Apr 23, 2018 at 2:14 PM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Miki, >> >> Sorry for the late response. >> There are basically two ways to implement an enri

Re: Checkpointing barriers

2018-04-24 Thread Fabian Hueske
Hi Alex, That's correct. The n refers to the n-th checkpoint. The checkpoint ID is important, because operators need to align the barriers to ensure that they consumed all inputs up to the point, where the barriers were injected into the stream. Each operator checkpoints its own state. For

Re: Flink and spatial data

2018-04-23 Thread Fabian Hueske
Hi Esa, there's no built-in support for handling spatial data in Flink. However, you can use any JVM-based spatial library in your library to perform such computations. One option is the ESRI library [1]. Also there is a JIRA issue [2] to add support for a few spatial functions (as provided by

Re: Building a Flink Connector

2018-04-23 Thread Fabian Hueske
Hi, I just realized that the documentation completely lacks a section about implementing custom source connectors :-( The JavaDocs of the relevant interface classes [1] [2] [3] [4] are quite extensive though. I'd also recommend to have a look at the implementations of other source connectors

Re: Unsure how to further debug - operator threads stuck on java.lang.Thread.State: WAITING

2018-04-23 Thread Fabian Hueske
Hi Miguel, Actually, a lot has changed since 1.4. Flink 1.5 will feature a completely (cluster) setup and deployment model. The dev effort is known as FLIP-6 [1]. So it is not unlikely that you discovered a regression. Would you mind opening a JIRA ticker for the issue? Thank you very much,

Re: data enrichment with SQL use case

2018-04-23 Thread Fabian Hueske
Hi Miki, Sorry for the late response. There are basically two ways to implement an enrichment join as in your use case. 1) Keep the meta data in the database and implement a job that reads the stream from Kafka and queries the database in an ASyncIO operator for every stream record. This should

Re: Join two data streams on a given key and diffrent common window size.

2018-04-23 Thread Fabian Hueske
Hi, The semantics of the joins offered by the DataStream API in Flink 1.4 and before as well as the upcoming 1.5 version are a bit messed up, IMO. Since Flink 1.4, Flink SQL implements a better windowed join [1]. DataStream and SQL can be easily integrated with each other. A similar

<    1   2   3   4   5   6   7   8   9   10   >