[ANNOUNCE] Dropping "CheckpointConfig.setPreferCheckpointForRecovery()"

2021-08-24 Thread Stephan Ewen
Hi Flink Community! A quick heads-up: We suggest removing the setting "CheckpointConfig.setPreferCheckpointForRecovery()" [1]. The setting has been deprecated since Flink 1.12 and is strongly discouraged, because it can lead to data loss or data duplication in different scenarios. Please see

Re: Bloom Filter - RocksDB - LinkageError Classloading

2021-08-05 Thread Stephan Ewen
omers would include Flink classes within the > application jar package, and it might cause problems if the client has > different flink version with servers. > > > Best, > Yun Tang > ------ > *From:* Stephan Ewen > *Sent:* Wednesday, August 4, 2

Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-04 Thread Stephan Ewen
graded > from Flink 1.9 to 1.13. > > Do we know why there's such a huge performance regression? Can we improve > this somehow with some flag tweaking? It would be great if we see a more in > depth explanation of the gains vs losses of upgrading. > > On Wed, Aug 4, 2021 at 3:08 PM S

[ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-04 Thread Stephan Ewen
Hi all! *!!! If you are a big user of the Embedded RocksDB State Backend and have performance sensitive workloads, please read this !!!* I want to quickly raise some awareness for a RocksDB version upgrade we plan to do, and some possible impact on application performance. *We plan to upgrade

Re: Bloom Filter - RocksDB - LinkageError Classloading

2021-08-04 Thread Stephan Ewen
@Yun Tang Does it make sense to add RocksDB to the "always parent-first options" to avoid these kind of errors when users package apps incorrectly? My feeling is that these packaging errors occur very frequently. On Wed, Aug 4, 2021 at 10:41 AM Yun Tang wrote: > Hi Sandeep, > > Did you include

Re: Recommended way to submit a SQL job via code without getting tied to a Flink version?

2021-06-30 Thread Stephan Ewen
Hi Sonam! To answer this, let me quickly provide some background on the two ways flink deployments / job submissions work. See also here for some background: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/overview/#deployment-modes What is common in all setups is

Re: [Statefun] Truncated Messages in Python workers

2021-05-20 Thread Stephan Ewen
Thanks for reporting this, it looks indeed like a potential bug. I filed this Jira for it: https://issues.apache.org/jira/browse/FLINK-22729 Could you share (here ot in Jira) what the stack on the Python Worker side is (for example which HTTP server)? Do you know if the message truncation

Re: Task Local Recovery with mountable disks in the cloud

2021-04-21 Thread Stephan Ewen
/cc dev@flink On Tue, Apr 20, 2021 at 1:29 AM Sonam Mandal wrote: > Hello, > > We've been experimenting with Task-local recovery using Kubernetes. We > have a way to specify mounting the same disk across Task Manager > restarts/deletions for when the pods get recreated. In this scenario, we >

Re: Re: Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

2021-03-22 Thread Stephan Ewen
> 发件人:Guowei Ma > 日 期:2021年03月09日 17:28:35 > 收件人:曹英杰(北牧) > 抄 送:Till Rohrmann; Stephan Ewen; > dev; user; Xintong Song< > tonysong...@gmail.com> > 主 题:Re: Re: [DISCUSSION] Introduce a separated memory pool for the TM > merge shuffle > > Hi, all &

Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

2021-03-05 Thread Stephan Ewen
Thanks Guowei, for the proposal. As discussed offline already, I think this sounds good. One thought is that 16m sounds very small for a default read buffer pool. How risky do you think it is to increase this to 32m or 64m? Best, Stephan On Fri, Mar 5, 2021 at 4:33 AM Guowei Ma wrote: > Hi,

Re: [DISCUSS] Remove flink-connector-filesystem module.

2020-10-28 Thread Stephan Ewen
+1 to remove the Bucketing Sink. It has been very common in the past to remove code that was deprecated for multiple releases in favor of reducing baggage. Also in cases that had no perfect drop-in replacement, but needed users to forward fit the code. I am not sure I understand why this case is

Re: [DISCUSS] FLIP-144: Native Kubernetes HA for Flink

2020-09-16 Thread Stephan Ewen
This is a very cool feature proposal. One lesson-learned from the ZooKeeper-based HA is that it is overly complicated to have the Leader RPC address in a different node than the LeaderLock. There is extra code needed to make sure these converge and the can be temporarily out of sync. A much

Re: Performance issue associated with managed RocksDB memory

2020-09-09 Thread Stephan Ewen
Hey Juha! I agree that we cannot reasonably expect from the majority of users to understand block sizes, area sizes, etc to get their application running. So the default should be "inform when there is a problem and suggest to use more memory." Block/arena size tuning is for the absolute

Re: Performance issue associated with managed RocksDB memory

2020-09-09 Thread Stephan Ewen
Thanks for driving, this, it is a great find and a nice proposal for a solution. I generally really like the idea of the block size sanity checker. I would also suggest to first go with logging a big fat WARNING rather than crashing the job. Crashing the job like this would be an unrecoverable

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-20 Thread Stephan Ewen
We have removed some public methods in the past, after a careful deprecation period, if they were not well working any more. The sentiment I got from users is that careful cleanup is in fact appreciated, otherwise things get confusing over time (the deprecated methods cause noise in the API).

Re: Any way to increase sort buffer size?

2020-05-04 Thread Stephan Ewen
Posting an update here, because it came up again: Have a look at https://issues.apache.org/jira/browse/FLINK-17192 specifically this comment: > There is a hidden/experimental feature in the sorter to offload large records, but it is not active by default. > > You can try and add

Re: Flink 1.10 Out of memory

2020-04-26 Thread Stephan Ewen
ack overflow is a partner program with Apache >> frameworks . How did we develop software before google or StackOverFlow or >> mailing lists ? >> >> I would have to say it was with comprehensive product documents and >> makeuse.org of software development tools. Mainly an u

Re: Flink 1.10 Out of memory

2020-04-24 Thread Stephan Ewen
>> error message "OutOfMemoryError: Direct buffer memory". >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Thu, Apr 23, 2020 at 6:11 PM Stephan Ewen wrote: >> >>> @Xintong and @Lasse could it be that the JVM hits the

Re: Flink 1.10 Out of memory

2020-04-24 Thread Stephan Ewen
memory from the OS. Given > the exception is thrown from calling a native method, I think the problem > is that not enough native memory can be allocated for executing the native > method. > > Thank you~ > > Xintong Song > > > > On Fri, Apr 24, 2020 at 3:40 PM Steph

Re: Flink 1.10 Out of memory

2020-04-24 Thread Stephan Ewen
ld see the > error message "OutOfMemoryError: Direct buffer memory". > > Thank you~ > > Xintong Song > > > > On Thu, Apr 23, 2020 at 6:11 PM Stephan Ewen wrote: > >> @Xintong and @Lasse could it be that the JVM hits the "Direct Memory" >> l

Re: Streaming Job eventually begins failing during checkpointing

2020-04-23 Thread Stephan Ewen
If something requires Beam to register a new state each time, then this is tricky, because currently you cannot unregister states from Flink. @Yu @Yun I remember chatting about this (allowing to explicitly unregister states so they get dropped from successive checkpoints) at some point, but I

Re: Flink 1.10 Out of memory

2020-04-23 Thread Stephan Ewen
@Xintong and @Lasse could it be that the JVM hits the "Direct Memory" limit here? Would increasing the "taskmanager.memory.framework.off-heap.size" help? On Mon, Apr 20, 2020 at 11:02 AM Zahid Rahman wrote: > As you can see from the task manager tab of flink web dashboard > > Physical

Re: Building with Hadoop 3

2020-03-04 Thread Stephan Ewen
Have you tried to just export Hadoop 3's classpath to `HADOOP_CLASSPATH` and see if that works out of the box? If the main use case is HDFS access, then there is a fair chance it might just work, because Flink uses only a small subset of the Hadoop FS API which is stable between 2.x and 3.x, as

Re: [DISCUSS] Drop Savepoint Compatibility with Flink 1.2

2020-02-20 Thread Stephan Ewen
ility. >> >> I personally think that it is very important for a project to keep a good >> pace in developing that old legacy stuff must be dropped from time to time. >> As long as there is an upgrade routine (via going to another flink release) >> that's fine

[DISCUSS] Drop Savepoint Compatibility with Flink 1.2

2020-02-20 Thread Stephan Ewen
Hi all! For some cleanup and simplifications, it would be helpful to drop Savepoint compatibility with Flink version 1.2. That version was released almost three years ago. I would expect that no one uses that old version any more in a way that they actively want to upgrade directly to 1.11.

Re: Not able to consume kafka massages in flink-1.10.0 version

2020-02-18 Thread Stephan Ewen
This looks like a Kafka version mismatch. Please check that you have the right Flink connector and not any other Kafka dependencies from in the classpath. On Tue, Feb 18, 2020 at 10:46 AM Avinash Tripathy < avinash.tripa...@stellapps.com> wrote: > Hi, > > I am getting this error message. > >

Re: Using retained checkpoints as savepoints

2020-02-18 Thread Stephan Ewen
Maybe one small addition: - for the heap state backend, there is no difference at all between the format and behavior of retained checkpoints (after the job is canceled) and savepoints. Same format and features. - For RocksDB incremental checkpoints, we do in fact support re-scaling, and I

Re: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Stephan Ewen
Congrats to us all. A big piece of work, nicely done. Let's hope that this helps our users make their existing use cases easier and also opens up new use cases. On Wed, Feb 12, 2020 at 3:31 PM 张光辉 wrote: > Greet work. > > Congxian Qiu 于2020年2月12日周三 下午10:11写道: > >> Great work. >> Thanks

[DISCUSS] Change default for RocksDB timers: Java Heap => in RocksDB

2020-01-16 Thread Stephan Ewen
Hi all! I would suggest a change of the current default for timers. A bit of background: - Timers (for windows, process functions, etc.) are state that is managed and checkpointed as well. - When using the MemoryStateBackend and the FsStateBackend, timers are kept on the JVM heap, like

[DISCUSS] Drop RequiredParameters and OptionType

2019-12-03 Thread Stephan Ewen
I just stumbled across these classes recently and was looking for sample uses. No examples and other tests in the code base seem to use RequiredParameters and OptionType. They also seem quite redundant with how ParameterTool itself works (tool.getRequired()). Should we drop them, in an attempt

[DISCUSS] Make Managed Memory always off-heap (Adjustment to FLIP-49)

2019-11-26 Thread Stephan Ewen
Hi all! Yesterday, some of the people involved in FLIP-49 had a long discussion about managed memory in Flink. Particularly, the fact that we have managed memory either on heap or off heap and that FLIP-49 introduced having both of these types of memory at the same time. ==> What we want to

Re: [DISCUSS] Support configure remote flink jar

2019-11-19 Thread Stephan Ewen
Would that be a feature specific to Yarn? (and maybe standalone sessions) For containerized setups, and init container seems like a nice way to solve this. Also more flexible, when it comes to supporting authentication mechanisms for the target storage system, etc. On Tue, Nov 19, 2019 at 5:29

Re: [PROPOSAL] Contribute Stateful Functions to Apache Flink

2019-10-14 Thread Stephan Ewen
ws[2]. In the end the suggestion is to adopt an existing code >>>>>>> base as is. It also proposes a new programs concept that could result >>>>>>> in a >>>>>>> shift of priorities for the community in a long run. &g

Re: Flink 1.9, MapR secure cluster, high availability

2019-09-19 Thread Stephan Ewen
Hi! Not sure what is happening here. - I cannot understand why MapR FS should use Flink's relocated ZK dependency - It might be that it doesn't and that all the logging we see probably comes from Flink's HA services. Maybe the MapR stuff uses a different logging framework and the logs do not

Re: [DISCUSS] Drop older versions of Kafka Connectors (0.9, 0.10) for Flink 1.10

2019-09-16 Thread Stephan Ewen
Qin > > > On Wed, Sep 11, 2019 at 4:26 PM Wesley Peng wrote: > >> >> >> on 2019/9/11 16:17, Stephan Ewen wrote: >> > We still maintain connectors for Kafka 0.8 and 0.9 in Flink. >> > I would suggest to drop those with Flink 1.10 and start supporting only >

Re: SIGSEGV error

2019-09-13 Thread Stephan Ewen
Given that the segfault happens in the JVM's ZIP stream code, I am curious is this is a bug in Flink or in the JVM core libs, that happens to be triggered now by newer versions of FLink. I found this on StackOverflow, which looks like it could be related:

[DISCUSS] Drop older versions of Kafka Connectors (0.9, 0.10) for Flink 1.10

2019-09-11 Thread Stephan Ewen
Hi all! We still maintain connectors for Kafka 0.8 and 0.9 in Flink. I would suggest to drop those with Flink 1.10 and start supporting only Kafka 0.10 onwards. Are there any concerns about this, or still a significant number of users of these versions? Best, Stephan

Re: Flink 1.9, MapR secure cluster, high availability

2019-08-29 Thread Stephan Ewen
Hi Maxim! The change of the MapR dependency should not have an impact on that. Do you know if the same thing worked in prior Flink versions? Is that a regression in 1.9? The exception that you report, is that from Flink's HA services trying to connect to ZK, or from the MapR FS client trying to

Re: Stale watermark due to unconsumed Kafka partitions

2019-08-19 Thread Stephan Ewen
You can use the Timestamp Assigner / Watermark Generator in two different ways: Per Kafka Partition or per parallel source. I would usually recommend per Kafka Partition, because if the read position in the partitions drifts apart (for example some partitions are read at the tail, some are read a

Re: Using S3 as a sink (StreamingFileSink)

2019-08-18 Thread Stephan Ewen
My first guess would also be the same as Rafi's: The lifetime of the MPU part files is so too low for that use case. Maybe this can help: - If you want to stop a job with a savepoint and plan to restore later from it (possible much later, so that the MPU Part lifetime might be exceeded), then

Re: Apache flink 1.7.2 security issues

2019-08-13 Thread Stephan Ewen
Hi! Thank you for reporting this! At the moment, the Flink REST endpoint is not secure in the way that you can expose it publicly. After all, you can submit Flink jobs to it which by definition support executing arbitrary code. Given that access to the REST endpoint allows by design arbitrary

Re: Batch mode with Flink 1.8 unstable?

2019-06-26 Thread Stephan Ewen
Hi Ken! Sorry to hear you are going through this experience. The major focus on streaming so far means that the DataSet API has stability issues at scale. So, yes, batch mode in current Flink version can be somewhat tricky. It is a big focus of Flink 1.9 to fix the batch mode, finally, and by

Re: [DISCUSS] Deprecate previous Python APIs

2019-06-14 Thread Stephan Ewen
(or figure out > how to use the jars yourself), but it is still feasible and can be > documented in the release notes. > > On 11/06/2019 15:30, Stephan Ewen wrote: > > Hi all! > > > > I would suggest to deprecating the existing python APIs for DataSet and > >

[DISCUSS] Deprecate previous Python APIs

2019-06-11 Thread Stephan Ewen
Hi all! I would suggest to deprecating the existing python APIs for DataSet and DataStream API with the 1.9 release. Background is that there is a new Python API under development. The new Python API is initially against the Table API. Flink 1.9 will support Table API programs without UDFs, 1.10

Re: [DISCUSS] Temporarily remove support for job rescaling via CLI action "modify"

2019-04-24 Thread Stephan Ewen
Sounds reasonable to me. If it is a broken feature, then there is not much value in it. On Tue, Apr 23, 2019 at 7:50 PM Gary Yao wrote: > Hi all, > > As the subject states, I am proposing to temporarily remove support for > changing the parallelism of a job via the following syntax [1]: > >

Re: [DISCUSS] Drop Elasticssearch 1 connector

2019-04-05 Thread Stephan Ewen
+1 to drop it Previously released versions are still available and compatible with newer Flink versions anyways. On Fri, Apr 5, 2019 at 2:12 PM Bowen Li wrote: > +1 for dropping elasticsearch 1 connector. > > On Wed, Apr 3, 2019 at 5:10 AM Chesnay Schepler > wrote: > >> Hello everyone, >> >>

Re: [DISCUSS] Remove forceAvro() and forceKryo() from the ExecutionConfig

2019-03-26 Thread Stephan Ewen
ocumentation [1]. For this, Kryo needs to be used for POJO types as > well, if I am not mistaken. > > Cheers, > > Konstantin > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/custom_serializers.html > > > On Tue, Mar 26, 2019 at 10:03 AM

Re: [DISCUSS] Remove forceAvro() and forceKryo() from the ExecutionConfig

2019-03-26 Thread Stephan Ewen
d using Kryo. > > Best > Yun Tang > ------ > *From:* Stephan Ewen > *Sent:* Tuesday, March 26, 2019 2:31 > *To:* dev; user > *Subject:* [DISCUSS] Remove forceAvro() and forceKryo() from the > ExecutionConfig > > Hi all! > > The Executio

[DISCUSS] Remove forceAvro() and forceKryo() from the ExecutionConfig

2019-03-25 Thread Stephan Ewen
Hi all! The ExecutionConfig has some very old settings: forceAvro() and forceKryo(), which are actually misleadingly named. They cause POJOs to use Avro or Kryo rather than the POJO serializer. I think we do not have a good case any more to use Avro for POJOs. POJOs that are also Avro types go

Re: Flink performance drops when async checkpoint is slow

2019-03-20 Thread Stephan Ewen
Hi Paul! One issue could be that state in Flnk 1.5.x state is asynchronous, but timers are synchronous. Timers are asynchronous starting from Flink 1.6. Best, Stephan On Fri, Mar 1, 2019 at 4:03 AM zhijiang wrote: > Hi Paul, > > Thanks for your feedback. If the at-least-once mode still

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-26 Thread Stephan Ewen
g & Jark gave some very valuable >> feedbacks and suggestions and I think we can definitely move the >> conversation forward to reach a more concrete doc first before we put in to >> the roadmap. Thanks for reviewing it and driving the roadmap effort! >> >> -- >>

Re: FLIP-16, FLIP-15 Status Updates?

2019-02-21 Thread Stephan Ewen
Hi John! I know some committers are working on iterations, but on a bigger update. That might subsume the FLIPs 15 and 16 eventually. I believe they will share some part of that soon (in a few weeks). Best, Stephan On Tue, Feb 19, 2019 at 5:45 PM John Tipper wrote: > Hi Timo, > > That’s

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-21 Thread Stephan Ewen
discussion: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Change-underlying-Frontend-Architecture-for-Flink-Web-Dashboard-td24902.html > > > What do you think? > > Regards, > Shaoxuan > > > > On Wed, Feb 13, 2019 at 7:21 PM Stephan Ewen wrote

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-21 Thread Stephan Ewen
docs-release-1.7/ > [5] https://flink.apache.org/ > > On Thu, Feb 14, 2019 at 2:26 AM Stephan Ewen wrote: > >> I think the website is better as well. >> >> I agree with Fabian that the wiki is not so visible, and visibility is >> the main motivation. >> Thi

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-14 Thread Stephan Ewen
And I guess we may need to update the roadmap very often at the >> beginning as there's so many discussions and proposals in community >> recently. We can move it into flink web site later when we feel it could be >> nailed down. >> >> Stephan Ewen 于2019年2月14日周四 下午5:

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-14 Thread Stephan Ewen
gt;> >> - Same window operators on bounded/unbounded Table API and DataStream API >> (currently OVER window only exists in SQL/TableAPI, DataStream API does >> not yet support) >> >> Best, >> Jincheng >> >> Stephan Ewen 于2019年2月13日周三 下午7:21写道: &g

[DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-13 Thread Stephan Ewen
Hi all! Recently several contributors, committers, and users asked about making it more visible in which way the project is currently going. Users and developers can track the direction by following the discussion threads and JIRA, but due to the mass of discussions and open issues, it is very

Re: [DISCUSS] Towards a leaner flink-dist

2019-01-23 Thread Stephan Ewen
There are some points where a leaner approach could help. There are many libraries and connectors that are currently being adding to Flink, which makes the "include all" approach not completely feasible in long run: - Connectors: For a proper experience with the Shell/CLI (for example for SQL)

Re: EXT :Re: StreamingFileSink cannot get AWS S3 credentials

2019-01-17 Thread Stephan Ewen
Regarding configurations: According to the code [1] , all config keys starting with "s3", "s3a" and "fs.s3a" are forwarded from the

Re: [ANNOUNCE] Apache Flink 1.6.3 released

2018-12-31 Thread Stephan Ewen
Great work, thank you for volunteering to manage this release, Gordon! On Tue, Dec 25, 2018 at 3:30 AM jincheng sun wrote: > Thanks a lot for being our release manager Gordon. > Thanks a lot for made this release possible! > > Cheers, > Jincheng > > Tzu-Li (Gordon) Tai 于2018年12月23日周日 下午9:35写道:

Re: [ANNOUNCE] Apache Flink 1.7.1 released

2018-12-31 Thread Stephan Ewen
Thank you very much for managing this bugfix release, Chesnay! On Tue, Dec 25, 2018 at 3:36 AM jincheng sun wrote: > Thanks for being the release manager Chesnay! > Thanks a lot for made this release possible! > > Thanks, > Jincheng > > Chesnay Schepler 于2018年12月23日周日 上午3:34写道: > >> The Apache

Re: [ANNOUNCE] Apache Flink 1.5.6 released

2018-12-31 Thread Stephan Ewen
Thank you to all who contributed, and, of course, especially to the release manager! On Fri, Dec 28, 2018 at 10:12 AM Till Rohrmann wrote: > Thanks a lot for being our release manager Thomas. Great work! Also thanks > to the community for making this release possible. > > Cheers, > Till > > On

Re: S3 connector Hadoop class mismatch

2018-09-23 Thread Stephan Ewen
I have to stick to the old > bucketing sink. I made it by explicitly setting Hadoop conf for the > bucketing sink in the user code. > > Thank you very much! > > Best, > Paul Lam > > > Stephan Ewen 于2018年9月21日周五 下午6:30写道: > >> Hi! >> >> The old bucketin

Re: S3 connector Hadoop class mismatch

2018-09-20 Thread Stephan Ewen
Hi! A few questions to diagnose/fix this: Do you explicitly configure the "hadoop.security.group.mapping"? - If not, this setting may have leaked in from a Hadoop config in the classpath. We are fixing this in Flink 1.7, to make this insensitive to such settings leaking in. - If yes, then

Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

2018-09-03 Thread Stephan Ewen
cases where that job doesn't run again therefore the >> completedCheckpoint gets staled. Is this something that could happen? >> >> Is there anyway to check by logging wether the job gets to Global Final >> State before we tear down the cluster? >> >> Cheers, >

Re: CompletedCheckpoints are getting Stale ( Flink 1.4.2 )

2018-08-31 Thread Stephan Ewen
Hi Laura! Vino had good pointers. There really should be no case in which this is not cleaned up. Is this a bounded job that ends? Is it always the last of the bounded job's checkpoints that remains? Best, Stephan On Fri, Aug 31, 2018 at 5:02 AM, vino yang wrote: > Hi Laura, > > First of

[DISCUSS] Remove the slides under "Community & Project Info"

2018-08-26 Thread Stephan Ewen
Hi all! In the past, we collected slide sets under the "Community & Project Info" side. I would like to see what the community thinks about removing them. There are currently several issues: - The list is not well maintained. Tthere are for example no 2018 slides at all. - Many slide sets

Re: Taskmanager SSL fails looking for Subject Alternative IP Address

2018-07-13 Thread Stephan Ewen
Thanks for reporting this. Given that hostname verification seems to be the issue, I would assume that the TaskManager somehow advertises a hostname in a form that is incompatile with the verification in some setups. While it would be interesting to dig deeper into why this happens, I think we

Re: Flink job hangs using rocksDb as backend

2018-07-11 Thread Stephan Ewen
Hi shishal! I think there is an issue with cancellation when many timers fire at the same time. These timers have to finish before shutdown happens, this seems to take a while in your case. Did the TM process actually kill itself in the end (and got restarted)? On Wed, Jul 11, 2018 at 9:29

Re: [DISCUSS] Flink 1.6 features

2018-07-04 Thread Stephan Ewen
for this https://issues.apache. > org/jira/browse/FLINK-9609 > > Cheers > Minglei > > 在 2018年6月4日,下午5:21,Stephan Ewen 写道: > > Hi Flink Community! > > The release of Apache Flink 1.5 has happened (yay!) - so it is a good time > to start talking about what to do for r

Re: help understand/debug high memory footprint on jobmanager

2018-06-29 Thread Stephan Ewen
Just saw Stefan's response, it is basically the same. We either null out the field on deploy or archival. On deploy would be even more memory friendly. @Steven - can you open a JIRA ticket for this? On Fri, Jun 29, 2018 at 9:08 PM, Stephan Ewen wrote: > The problem se

Re: help understand/debug high memory footprint on jobmanager

2018-06-29 Thread Stephan Ewen
The problem seems to be that the Executions that are kept for history (mainly metrics / web UI) still hold a reference to their TaskStateSnapshot. Upon archival, that field needs to be cleared for GC. This is quite clearly a bug... On Fri, Jun 29, 2018 at 11:29 AM, Stefan Richter <

Re: [DISCUSS] Flink 1.6 features

2018-06-08 Thread Stephan Ewen
FLINK-7129 (Support dynamically changing CEP patterns) be included in > 1.6? There were discussions about possibly including it in 1.6: > http://mail-archives.apache.org/mod_mbox/flink-user/201803. > mbox/%3cCAMq=OU7gru2O9JtoWXn1Lc1F7NKcxAyN6A3e58kxctb4b508RQ@ > mail.gmail.com%3e > &g

Re: Output batch to Kafka

2018-06-05 Thread Stephan Ewen
You could go with Chesnay's suggestion, which might be the quickest fix. Creating a KafkaOutputFormat (possibly wrapping the KafkaProducer) would be a bit cleaner. Would be happy to have that as a contribution, actually ;-) If you care about producing "exactly once" using Kafka Transactions

[DISCUSS] Flink 1.6 features

2018-06-04 Thread Stephan Ewen
Hi Flink Community! The release of Apache Flink 1.5 has happened (yay!) - so it is a good time to start talking about what to do for release 1.6. *== Suggested release timeline ==* I would propose to release around *end of July* (that is 8-9 weeks from now). The rational behind that: There was

Re: S3 for state backend in Flink 1.4.0

2018-06-01 Thread Stephan Ewen
A heads up on this front: - For state backends during checkpointing, I would suggest to use the flink-s3-fs-presto, which is quite a bit faster than the flink-s3-fs-hadoop by avoiding a bunch of unnecessary metadata operations. - We have started work on re-writing the Bucketing Sink to make

Re: Multiple hdfs

2018-05-23 Thread Stephan Ewen
I think that Hadoop recommends to solve such setups with a viewfs:// that spans both HDFS clusters and then the two different clusters look like different paths within on file system. Similar as mounting different file systems into one directory tree in unix. On Tue, May 22, 2018 at 4:41 PM, Kien

Re: Lost JobManager

2018-05-11 Thread Stephan Ewen
Hi! This somehow looks like the YARN shutdown call overtakes the call that marks the Job as successful. That should not be the case. Looking at the logs, you are running a Java 7 installation, so I assume this must be some Flink 1.3.x based build. Can you let us know which version this is based

Re: Delay in Flink timers

2018-05-11 Thread Stephan Ewen
Checkpoints are largely asynchronous, but the checkpointing of timers has some synchronous component (which we are currently working on getting rid of). So when you have a lot of timers, streams stall for a short time while the timers are checkpointed. If all goes as planned, Flink 1.6 will not

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Stephan Ewen
; Thanks, > Amit > > On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen <se...@apache.org> wrote: > > Hi Amit! > > > > Thanks for sharing this, this looks like a regression with the network > stack > > changes. > > > > The log you shared from the Ta

Re: Odd job failure

2018-05-03 Thread Stephan Ewen
Concerning the connectivity issue - it is hard to say anything more without any logs or details. Does the JM log that it is trying to send tasks to the 3rd TM, but the TM does not show signs of executing them? On Thu, May 3, 2018 at 10:22 AM, Stephan Ewen <se...@apache.org> wrote: >

Re: Odd job failure

2018-05-03 Thread Stephan Ewen
Hi Elias! Concerning the spilling of alignment data to disk: - In 1.4.x , you can set an upper limit via " task.checkpoint.alignment.max-size ". See [1]. - In 1.5.x, the default is a back-pressure based alignment, which does not spill any more. Best, Stephan [1]

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Stephan Ewen
Hi Amit! Thanks for sharing this, this looks like a regression with the network stack changes. The log you shared from the TaskManager gives some hint, but that exception alone should not be a problem. That exception can occur under a race between deployment of some tasks while the whole job is

Re: A trivial update on README

2018-04-28 Thread Stephan Ewen
Hi! Thank you for pinging us on this one - I left a comment, after that I would merge the PR. The Flink community had a bit of a slow phase for PRs, with many committers being at the Flink Forward Conference, being involved in the 1.5 release testing. We plan to speed up reviews again now.

Re: DFS problem with removing checkpoint

2018-04-22 Thread Stephan Ewen
quot;completedCheckpoint" are growing and also > dirs in "state.backend.fs.checkpointdir/JobId/check-". > > In my case i have Windows DFS filesystem mounted on linux with cifs > protocol. > > Can you give me a hint or description which process is responsible for > removing

Re: Flink/Kafka POC performance issue

2018-04-17 Thread Stephan Ewen
A few ideas how to start debugging this: - Try deactivating checkpoints. Without that, no work goes into persisting rocksdb data to the checkpoint store. - Try to swap RocksDB for the FsStateBackend - that reduces serialization cost for moving data between heap and offheap (rocksdb). - Do

Re: InterruptedException when async function is cancelled

2018-04-17 Thread Stephan Ewen
Agreed. It is fixed in 1.5 and in the 1.4.x branch. The fix came after 1.4.2, so it s not released as of now. On Tue, Apr 17, 2018 at 7:47 PM, Ken Krugler wrote: > Hi Timo, > > [Resending from an address the Apache list server likes…] > > I discussed this with Till

Re: CaseClassSerializer and/or TraversableSerializer may still not be threadsafe?

2018-04-17 Thread Stephan Ewen
Thanks for reporting this, also thanks for checking out that this works with RocksDB and also with synchronous checkpoints. I would assume that this issue lies not in the serializer itself, but in accidental sharing in the FsStateBackend async snapshots. Do you know if the issue still exists in

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Stephan Ewen
Please make sure you have set a number of re-tries and have checkpointing activated if you use streaming. On Fri, Mar 30, 2018 at 1:59 PM, dhirajpraj wrote: > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs

Re: Unable to load AWS credentials: Flink 1.2.1 + S3 + Kubernetes

2018-04-03 Thread Stephan Ewen
leSystem: > - hadoop-aws-2.7.2.jar > - aws-java-sdk-1.7.4.jar > - httpcore-4.2.5.jar > - httpclient-4.2.5.jar > > Am I missing some dependencies here? > > Any suggestions on troubleshooting the issue? > > > > @Stephan Ewen <se...@apache.org>

Re: DFS problem with removing checkpoint

2018-04-02 Thread Stephan Ewen
; completed, previous file and dir are deleted, and this is ok, because i > always have only one checkpoint. > But in my case when next checkpoint is completed, the previous is not > deleted and this happens when job is in running state. > > My be you know why those files/dirs ar

Subject: Last chance to register for Flink Forward SF (April 10). Get 25% discount

2018-03-29 Thread Stephan Ewen
Hi all! There are still some spots left to attend Flink Forward San Francisco, so sign up soon before registration closes. Use this promo code to get 25% off: MailingListFFSF The 1-day conference takes place on Tuesday, April 10 in downtown SF. We have a great lineup of speakers from companies

Re: Unable to load AWS credentials: Flink 1.2.1 + S3 + Kubernetes

2018-03-29 Thread Stephan Ewen
Using AWS credentials with Kubernetes are not trivial. Have you looked at AWS / Kubernetes docs and projects like https://github.com/jtblin/kube2iam which bridge between containers and AWS credentials? Also, Flink 1.2.1 is quite old, you may want to try a newer version. 1.4.x has a bit of an

Re: Incremental checkpointing performance

2018-03-29 Thread Stephan Ewen
return false; > >> } > >> }).keyBy(1).countWindow(WINDOWLENGTH, > SLIDELENGTH).sum(2); > >> incrementStream2.writeUsingOutputFormat(new > >> DiscardingOutputFormat<Tuple4<String, Float, Integer, > >> String>>()); > >> > >> try { > >> env.execute("Flink application."); > >> } catch (Exception e) { > >> logger.error("Error in starting the Flink stream > >> application: " + e.getMessage(), e); > >> } > >> } > >> } > >> > >> > > //-- > --- > >> > >> I have attached two charts (Average_latencies.jpg and > >> Average_state_sizes.jpg) with the results and another image with the > >> Flink dashboard (Flink-Dashboard.png). The average state size chart > >> indicates that the size of an incremental checkpoint is smaller than a > >> full (i.e., complete) checkpoint. This is the expected behavior from any > >> incremental checkpointing system since the incremental checkpoint just > >> stores the delta change. However, the average latency chart indicates > >> that the average latency for taking an incremental checkpoint from Flink > >> is larger than taking a complete (i.e., Full) checkpoint. Is this the > >> expected behavior? I have highlighted the two fields in the > >> "Flink-Dashboard.png" which we used as the fields for the average > >> latency and the average state size. Note that to convert the incremental > >> checkpointing application to full checkpointing application we just > >> commented out the following lines in the above code. > >> > >> try { > >> env.setStateBackend(new RocksDBStateBackend( > >> new > >> FsStateBackend("file:///home/ubuntu/tmp-flink-rocksdb"), > >> true)); > >> } catch (IOException e) { > >> e.printStackTrace(); > >> } > >> > >> Thanks, > >> Miyuru > > -- > Nico Kruber | Software Engineer > data Artisans > > Follow us @dataArtisans > -- > Join Flink Forward - The Apache Flink Conference > Stream Processing | Event Driven | Real Time > -- > Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany > data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > >

Re: Error running on Hadoop 2.7

2018-03-27 Thread Stephan Ewen
Thanks, in that case it sounds like it is more related to Hadoop classpath mixups, rather than class loading. On Mon, Mar 26, 2018 at 3:03 PM, ashish pok <ashish...@yahoo.com> wrote: > Stephan, we are in 1.4.2. > > Thanks, > > -- Ashish > > On Mon, Mar 26, 2018 at

Re: Error running on Hadoop 2.7

2018-03-26 Thread Stephan Ewen
If you are on Flink 1.4.0 or 1.4.1, please check if you accidentally have Hadoop in your application jar. That can mess up things with child-first classloading. 1.4.2 should handle Hadoop properly in any case. On Sun, Mar 25, 2018 at 3:26 PM, Ashish Pokharel wrote: > Hi

Re: entrypoint for executing job in task manager

2018-03-21 Thread Stephan Ewen
It would be great to understand a bit more what the exact requirements here are, and what setup you use. I am not a dependency injection expert, so let me know if what I am suggesting here is complete bogus. *(1) Fix set of libraries for Dependency Injection, or dedicated container images per

Re: Is Hadoop 3.0 integration planned?

2018-03-21 Thread Stephan Ewen
That is definitely a good thing to have, would like to have a discussion about how to approach that after 1.5 is released. On Wed, Mar 21, 2018 at 5:39 AM, Jayant Ameta wrote: > > Jayant Ameta >

Re: [ANNOUNCE] Weekly community update #12

2018-03-20 Thread Stephan Ewen
Great initiative, highly appreciated, Till! On Mon, Mar 19, 2018 at 7:06 PM, Till Rohrmann wrote: > Dear community, > > I've noticed that Flink has grown quite a bit in the past. As a > consequence it can be quite challenging to stay up to date. Especially for > community

Re: Strange behavior on filter, group and reduce DataSets

2018-03-20 Thread Stephan Ewen
To diagnose that, can you please check the following: - Change the Person data type to be immutable (final fields, no setters, set fields in constructor instead). Does that make the problem go away? - Change the Person data type to not be a POJO by adding a dummy fields that is never used,

  1   2   3   4   5   6   7   8   9   >