Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread SHI Xiaogang
Hi Dadashov, You may have a look at method YarnResourceManager#onContainersAllocated which will launch containers (via NMClient#startContainer) after containers are allocated. The launching is performed in the main thread of YarnResourceManager and the launching is synchronous/blocking.

Re: [SURVEY] Is the default restart delay of 0s causing problems?

2019-08-30 Thread Zhu Zhu
In our production, we usually override the restart delay to be 10 s. We once encountered cases that external services are overwhelmed by reconnections from frequent restarted tasks. As a safer though not optimized option, a default delay larger than 0 s is better in my opinion. 未来阳光

[ANNOUNCE] Kinesis connector becomes part of Flink releases

2019-08-30 Thread Bowen Li
Hi all, I'm glad to announce that, as #9494 was merged today, flink-connector-kinesis is officially of Apache 2.0 license now in master branch and its artifact will be deployed to Maven central as part of Flink releases starting from Flink 1.10.0. Users

[ANNOUNCEMENT] September 2019 Bay Area Apache Flink Meetup

2019-08-30 Thread Xuefu Zhang
Hi all, As promised, we planned to have quarterly Flink meetup and now it's about the time. Thus, I'm happy to announce that the next Bay Area Apache Flink Meetup [1] is scheduled on Sept. 24 at Yelp, 140 New Montgomery in San Francisco. Schedule: 6:30 - Door open 6:30 - 7:00 PM Networking and

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Elkhan Dadashov
Thanks everyone for valuable input and sharing your experience for tackling the issue. Regarding suggestions : - We provision some common jars in all cluster nodes *-->* but this requires dependence on Infra Team schedule for handling common jars/updating - Making Uberjar slimmer *-->* tried

Re: problem with avro serialization

2019-08-30 Thread Debasish Ghosh
Hello Aljoscha - I made a comment on your PR ( https://github.com/apache/flink/pull/9565/files#r319598469). With the suggested fix it runs fine .. Thanks. regards. On Fri, Aug 30, 2019 at 4:48 PM Debasish Ghosh wrote: > Thanks a lot .. sure I can do a build with this PR and check. > >

Re: Build Flink against a vendor specific Hadoop version

2019-08-30 Thread Elise RAMÉ
Thank you all ! Classpath option works for me and is easier so I’ll do this way. About flink-shaded and vendor-repos, would it be helpful if I describe this issue in a Jira ticket ? > Le 30 août 2019 à 11:43, Chesnay Schepler a écrit : > > It appears we did not port the vendor-repos profile

[SURVEY] Is the default restart delay of 0s causing problems?

2019-08-30 Thread Till Rohrmann
Hi everyone, I wanted to reach out to you and ask whether decreasing the default delay to `0 s` for the fixed delay restart strategy [1] is causing trouble. A user reported that he would like to increase the default value because it can cause restart storms in case of systematic faults [2]. The

Re: Assigning UID to Flink SQL queries

2019-08-30 Thread Yuval Itzchakov
Anyone? On Tue, 27 Aug 2019, 17:23 Yuval Itzchakov, wrote: > Hi, > > We a have a bunch of Flink SQL queries running in our Flink environment. > For > regular Table API interactions, we can override `uid` which also gives us > an > indicative name for the thread/UI to look at. For Flink SQL

Re: Non incremental window function accumulates unbounded state with RocksDb

2019-08-30 Thread William Jonsson
Thanks for your answer Yun. I agree, I don’t believe that either, however that’s my empirical observation. Those statistics are from save points. Basically the jobs are running towards a production kafka so no, not exactly the same input. However, these statistics are from several runs

Re: Non incremental window function accumulates unbounded state with RocksDb

2019-08-30 Thread Yun Tang
Hi William I don't believe the same job would have 70~80GB state for RocksDB while it's only 200MB for HeapStateBackend even though RocksDB has some space amplification. Are you sure the job received the same input throughput with different state backends and they both run well without any

Re: best practices on getting flink job logs from Hadoop history server?

2019-08-30 Thread Yun Tang
Hi Yu If you have client job log and you could find your application id from below description: The Flink YARN client has been started in detached mode. In order to stop Flink on YARN, use the following command or a YARN web interface to stop it: yarn application -kill {appId} Please also

Non incremental window function accumulates unbounded state with RocksDb

2019-08-30 Thread William Jonsson
Hello, I have a Flink pipeline reading data from Kafka which is keyed (in the pseudocode example below it is keyed on the first letter in the string, in our pipeline it is keyed on a predefined key) and processed in sliding windows with a duration of 60m every 10:th minute. The time setting is

Re: problem with avro serialization

2019-08-30 Thread Aljoscha Krettek
Hi, I cut a PR that should fix this issue for Avrohugger: https://github.com/apache/flink/pull/9565 Would you be able to build this and see if it solves your problem? Best, Aljoscha > On 30. Aug 2019, at 09:02, Debasish Ghosh wrote: > > From

Re: Error while using catalog in .yaml file

2019-08-30 Thread Yebgenya Lazarkhosrouabadi
Hello, I build Flink from source and have the flink-connector-hive jar file now. I copied this file to the lib directory of flink but I still get the same error as I try to run ./sql-client.sh embedded. I get this error: Exception in thread "main"

Re: best practices on getting flink job logs from Hadoop history server?

2019-08-30 Thread Zhu Zhu
Hi Yu, Regarding #2, Currently we search task deployment log in JM log, which contains info of the container and machine the task deploys to. Regarding #3, You can find the application logs aggregated by machines on DFS, this path of which relies on your YARN config. Each log may still include

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Zhu Zhu
One optimization that we take is letting yarn to reuse the flink-dist jar which was localized when running previous jobs. Thanks, Zhu Zhu Jörn Franke 于2019年8月30日周五 下午4:02写道: > Increase replication factor and/or use HDFS cache >

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Jörn Franke
Increase replication factor and/or use HDFS cache https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Try to reduce the size of the Jar, eg the Flink libraries do not need to be included. > Am 30.08.2019 um 01:09 schrieb Elkhan Dadashov : > >

best practices on getting flink job logs from Hadoop history server?

2019-08-30 Thread Yu Yang
Hi, We run flink jobs through yarn on hadoop clusters. One challenge that we are facing is to simplify flink job log access. The flink job logs can be accessible using "yarn logs $application_id". That approach has a few limitations: 1. It is not straightforward to find yarn application id

Re: Using Avro SpecficRecord serialization instead of slower ReflectDatumWriter/GenericDatumWriter

2019-08-30 Thread Till Rohrmann
Hi Roshan, these kind of questions should be posted to Flink's user mailing list. I've cross posted it now. If you are using Flink's latest version and your type extends `SpecificRecord`, then Flink's AvroSerializer should use the `SpecificDatumWriter`. If this is not the case, then this sounds

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Jeff Zhang
I can think of 2 approaches: 1. Allow flink to specify the replication of the submitted uber jar. 2. Allow flink to specify config flink.yarn.lib which is all the flink related jars that are hosted on hdfs. This way users don't need to build and submit a fat uber jar every time. And those flink

Re: Flink 1.9, MapR secure cluster, high availability

2019-08-30 Thread Maxim Parkachov
Hi Stephan, With previous versions, I tried around 1.7, I always had to compile MapR hadoop to get it working. With 1.9 I took hadoop-less Flink, which worked with MapR FS until I switched on HA. So it is hard to say if this is regression or not. The error happens when Flink tries to initialize

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Till Rohrmann
For point 2. there exists already a JIRA issue [1] and a PR. I hope that we can merge it during this release cycle. [1] https://issues.apache.org/jira/browse/FLINK-13184 Cheers, Till On Fri, Aug 30, 2019 at 4:06 AM SHI Xiaogang wrote: > Hi Datashov, > > We faced similar problems in our

Re: checkpoint failure suddenly even state size less than 1 mb

2019-08-30 Thread Yun Tang
Hi Sushant What confuse me is that why source task cannot complete checkpoint in 3 minutes [1]. If no sub-task has ever completed the checkpoint, which means even source task cannot complete. Actually source task would not need to buffer the data. From what I see, it might be affected by

Re: problem with avro serialization

2019-08-30 Thread Debasish Ghosh
>From https://stackoverflow.com/a/56104518 .. AFAIK the only solution is to update Flink to use avro's > non-reflection-based constructors in AvroInputFormat >