Re: Too many open files
Not sure if you have looked at FLINK-8707 FYI On Tue, Mar 20, 2018 at 2:13 PM, Govindarajan Srinivasaraghavan < govindragh...@gmail.com> wrote: > Hi, > > We have a streaming job that runs on flink in docker and checkpointing > happens every 10 seconds. After several starts and cancellations we are > facing this issue with file handles. > > The job reads data from kafka, processes it and writes it back to kafka and > we are using RocksDB state backend. For now we have increased the number > file handles to resolve the problem but would like to know if this is > expected or is it an issue. Thanks. > > java.io.FileNotFoundException: > /tmp/flink-io-b3043cd6-50c8-446a-8c25-fade1b1862c0/ > cb317fc2578db72b3046468948fa00f2f17039b6104e72fb8c58938e5869cfbc.0.buffer > (Too many open files) > > at java.io.RandomAccessFile.open0(Native Method) > > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > > at java.io.RandomAccessFile.(RandomAccessFile.java:243) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel( > BufferSpiller.java:259) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.< > init>(BufferSpiller.java:120) > > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.< > init>(BarrierBuffer.java:149) > > at > org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor. > (StreamTwoInputProcessor.java:147) > > at > org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.init( > TwoInputStreamTask.java:79) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask. > invoke(StreamTask.java:235) > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) > > at java.lang.Thread.run(Thread.java:748) > > Regards, > Govind >
Too many open files
Hi, We have a streaming job that runs on flink in docker and checkpointing happens every 10 seconds. After several starts and cancellations we are facing this issue with file handles. The job reads data from kafka, processes it and writes it back to kafka and we are using RocksDB state backend. For now we have increased the number file handles to resolve the problem but would like to know if this is expected or is it an issue. Thanks. java.io.FileNotFoundException: /tmp/flink-io-b3043cd6-50c8-446a-8c25-fade1b1862c0/cb317fc2578db72b3046468948fa00f2f17039b6104e72fb8c58938e5869cfbc.0.buffer (Too many open files) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) at java.io.RandomAccessFile.(RandomAccessFile.java:243) at org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259) at org.apache.flink.streaming.runtime.io.BufferSpiller.(BufferSpiller.java:120) at org.apache.flink.streaming.runtime.io.BarrierBuffer.(BarrierBuffer.java:149) at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.(StreamTwoInputProcessor.java:147) at org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.init(TwoInputStreamTask.java:79) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Regards, Govind
Re: [ANNOUNCE] Weekly community update #12
Great initiative, highly appreciated, Till! On Mon, Mar 19, 2018 at 7:06 PM, Till Rohrmannwrote: > Dear community, > > I've noticed that Flink has grown quite a bit in the past. As a > consequence it can be quite challenging to stay up to date. Especially for > community members who don't follow Flink's MLs on a daily basis. > > In order to keep a bigger part of the community in the loop, I wanted to > try out a weekly update letter where I update the community with what > happened from my perspective. Since I also don't know everything I want to > encourage others to post updates about things they deem important and > relevant for the community to this thread. > > # Weekly update #12: > > ## Flink 1.5 release: > - The Flink community is still working on the Flink 1.5 release. Hopefully > Flink 1.5 can be released in the next weeks. > - The main work concentrated last week on stabilizing Flip-6 and adding > more automated tests [1]. The Flink community appreciates every helping > hand with adding more end to end tests. > - Consequently, the committed changes mainly consisted of bug fixes and > test hardening. > - By the end of this week, we hope to have a RC ready which can be used > for easier release testing. Given the big changes (network stack and > Flip-6), the RC will most likely still contain some rough edges. In order > to smooth them out, it would be good if we run Flink 1.5 in as many > different scenarios as possible. > > ## Flink 1.3.3. has been released > - Flink 1.3.3 containing an important fix for properly handling > checkpoints in case of a DFS problem has been released. We highly recommend > that all users running Flink 1.3.2 upgrade swiftly to Flink 1.3.3. > > ## Misc: > - Shuyi opened a discussion about improving Flink's security [2]. If you > are interested and want to help with the next steps please engage in the > discussion. > > PS: Don't worry that you've missed the first 11 weekly community updates. > It's just this week's number. > > [1] http://apache-flink-mailing-list-archive.1008284.n3. > nabble.com/ANNOUNCE-Flink-1-5-release-testing-effort-td21646.html > [2] http://apache-flink-mailing-list-archive.1008284. > n3.nabble.com/DISCUSS-Flink-security-improvements-td21068.html > > Cheers, > Till >
Re: FLIP-6 / Kubernetes
@Eron That is definitely the way we want to suggest as the way to use k8s in the future. This feature did not make it for 1.5, but should come very soon after. @Thomas An implementation of a ResourceManager for k8s should come in the near future. Would be happy to jump on a joint FLIP, after the 1.5 release and Flink Forward (in three weeks or so). On Tue, Mar 20, 2018 at 4:25 PM, Eron Wrightwrote: > Till, is it possible to package a Flink application as a self-contained > deployment on Kubernetes? I mean, run a Flink application using 'flink > run' such that it launches its own RM/JM and waits for a sufficient # of > TMs to join? > > Thanks! > > On Mon, Mar 19, 2018 at 2:57 AM, Till Rohrmann > wrote: > > > I forgot to add Flink's K8 documentation [1] which might also be helpful > > with getting started. > > > > [1] > > https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/ > > kubernetes.html > > > > Cheers, > > Till > > > > On Mon, Mar 19, 2018 at 10:54 AM, Till Rohrmann > > wrote: > > > > > Hi Thomas, > > > > > > I think the one way to get started would be to adapt the Flink docker > > > images [1,2] to run with Flink 1.5. Per default, they will use the > Flip-6 > > > components. > > > > > > Flink 1.5 won't come with a dedicated integration with Kubernetes which > > is > > > able to start new pods. However, it should work that you manually or by > > the > > > virtues of an external system start new pods which can be used. Once > the > > > new pods have been started and the TaskExecutors have registered one > > would > > > have to rescale the job manually to the new parallelism. > > > > > > [1] https://flink.apache.org/news/2017/05/16/official-docker- > image.html > > > [2] https://hub.docker.com/r/_/flink/ > > > > > > Cheers, > > > Till > > > > > > On Sun, Mar 18, 2018 at 9:05 PM, Thomas Weise wrote: > > > > > >> Hi, > > >> > > >> What would be a good starting point to try out Flink on Kubernetes > (any > > >> examples / tutorials)? > > >> > > >> Also, will the FLIP-6 work in 1.5 enable dynamic scaling on > Kubernetes? > > >> > > >> Thanks, > > >> Thomas > > >> > > > > > > > > >
[jira] [Created] (FLINK-9036) Add default value via suppliers
Stephan Ewen created FLINK-9036: --- Summary: Add default value via suppliers Key: FLINK-9036 URL: https://issues.apache.org/jira/browse/FLINK-9036 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 1.6.0 Earlier versions had a default value in {{ValueState}}. We dropped that, because the value would have to be duplicated on each access, to be safe against side effects when using mutable types. For convenience, we should re-add the feature, but using a supplier/factory function to create the default value on access. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9035) State Descriptors have broken hashCode() and equals()
Stephan Ewen created FLINK-9035: --- Summary: State Descriptors have broken hashCode() and equals() Key: FLINK-9035 URL: https://issues.apache.org/jira/browse/FLINK-9035 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.4.2, 1.5.0 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 1.6.0 The following code fails with a {{NullPointerException}}: {code} ValueStateDescriptor descr = new ValueStateDescriptor<>("name", String.class); descr.hashCode(); {code} The {{hashCode()}} function tries to access the {{serializer}} field, which may be uninitialized at that point. The {{equals()}} method is equally broken (no pun intended): {code} ValueStateDescriptor a = new ValueStateDescriptor<>("name", String.class); ValueStateDescriptor b = new ValueStateDescriptor<>("name", String.class); a.equals(b) // exception b.equals(a) // exception a.initializeSerializerUnlessSet(new ExecutionConfig()); a.equals(b) // false b.equals(a) // exception b.initializeSerializerUnlessSet(new ExecutionConfig()); a.equals(b) // true b.equals(a) // true {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9034) State Descriptors drop TypeInformation on serialization
Stephan Ewen created FLINK-9034: --- Summary: State Descriptors drop TypeInformation on serialization Key: FLINK-9034 URL: https://issues.apache.org/jira/browse/FLINK-9034 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.4.2, 1.5.0 Reporter: Stephan Ewen Fix For: 1.6.0 The following code currently causes problems {code} public class MyFunction extends RichMapFunction { private final ValueStateDescriptor descr = new ValueStateDescriptor<>("state name", MyType.class); private ValueState state; @Override public void open() { state = getRuntimeContext().getValueState(descr); } } {code} The problem is that the state descriptor drops the type information and creates a serializer before serialization as part of shipping the function in the cluster. To do that, it initializes the serializer with an empty execution config, making serialization inconsistent. This is mainly an artifact from the days when dropping the type information before shipping was necessary, because the type info was not serializable. It now is, and we can fix that bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9033) Replace usages of deprecated TASK_MANAGER_NUM_TASK_SLOTS
Hai Zhou created FLINK-9033: --- Summary: Replace usages of deprecated TASK_MANAGER_NUM_TASK_SLOTS Key: FLINK-9033 URL: https://issues.apache.org/jira/browse/FLINK-9033 Project: Flink Issue Type: Improvement Components: Configuration Affects Versions: 1.5.0 Reporter: Hai Zhou Assignee: Hai Zhou Fix For: 1.5.0 The deprecated ConfigConstants#TASK_MANAGER_NUM_TASK_SLOTS is still used a lot. We should replace these usages with TaskManagerOptions#NUM_TASK_SLOTS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: FLIP-6 / Kubernetes
Till, is it possible to package a Flink application as a self-contained deployment on Kubernetes? I mean, run a Flink application using 'flink run' such that it launches its own RM/JM and waits for a sufficient # of TMs to join? Thanks! On Mon, Mar 19, 2018 at 2:57 AM, Till Rohrmannwrote: > I forgot to add Flink's K8 documentation [1] which might also be helpful > with getting started. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/ > kubernetes.html > > Cheers, > Till > > On Mon, Mar 19, 2018 at 10:54 AM, Till Rohrmann > wrote: > > > Hi Thomas, > > > > I think the one way to get started would be to adapt the Flink docker > > images [1,2] to run with Flink 1.5. Per default, they will use the Flip-6 > > components. > > > > Flink 1.5 won't come with a dedicated integration with Kubernetes which > is > > able to start new pods. However, it should work that you manually or by > the > > virtues of an external system start new pods which can be used. Once the > > new pods have been started and the TaskExecutors have registered one > would > > have to rescale the job manually to the new parallelism. > > > > [1] https://flink.apache.org/news/2017/05/16/official-docker-image.html > > [2] https://hub.docker.com/r/_/flink/ > > > > Cheers, > > Till > > > > On Sun, Mar 18, 2018 at 9:05 PM, Thomas Weise wrote: > > > >> Hi, > >> > >> What would be a good starting point to try out Flink on Kubernetes (any > >> examples / tutorials)? > >> > >> Also, will the FLIP-6 work in 1.5 enable dynamic scaling on Kubernetes? > >> > >> Thanks, > >> Thomas > >> > > > > >
[jira] [Created] (FLINK-9032) Deprecate ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE
Hai Zhou created FLINK-9032: --- Summary: Deprecate ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE Key: FLINK-9032 URL: https://issues.apache.org/jira/browse/FLINK-9032 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.5.0 Reporter: Hai Zhou Assignee: Hai Zhou Fix For: 1.5.0, 1.6.0 this contants("yarn.container-start-command-template") has disappeared from the [1.5.0-SNAPSHOT docs|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html]. We should restore it, and I think it should be renamed "containerized.start-command-template". [~Zentol], what do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9030) JobManager fails to archive job to FS when TM is lost
Jared Stehler created FLINK-9030: Summary: JobManager fails to archive job to FS when TM is lost Key: FLINK-9030 URL: https://issues.apache.org/jira/browse/FLINK-9030 Project: Flink Issue Type: Bug Components: History Server, JobManager, Mesos Affects Versions: 1.4.0 Reporter: Jared Stehler We are running flink on mesos, and are finding that when a job fails due to a task manager getting lost (from an OOM kill), the job isn't archived properly into the history server dir on the filesystem. When this happens, the job does appear in the finished listing in the job manager's in-memory archive view, and is accessible in the running job manager's rest api, but obviously not in the history server's rest api. This is causing us issues as we are using the history server as a system of record for canceled or failed jobs in order to determine previous savepoint / external checkpoints. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9029) Getting write permission from HDFS after updating flink-1.40 to flink-1.4.2
Mohammad Abareghi created FLINK-9029: Summary: Getting write permission from HDFS after updating flink-1.40 to flink-1.4.2 Key: FLINK-9029 URL: https://issues.apache.org/jira/browse/FLINK-9029 Project: Flink Issue Type: Bug Affects Versions: 1.4.2, 1.4.1 Environment: * Flink-1.4.2 (Flink-1.4.1) * Hadoop 2.6.0-cdh5.13.0 with 4 nodes in service and {{Security is off.}} * Ubuntu 16.04.3 LTS * Java 8 Reporter: Mohammad Abareghi *Environment* * Flink-1.4.2 * Hadoop 2.6.0-cdh5.13.0 with 4 nodes in service and {{Security is off.}} * Ubuntu 16.04.3 LTS * Java 8 *Description* I have a Java job in flink-1.4.0 which writes to HDFS in a specific path. After updating to flink-1.4.2 I'm getting the following error from Hadoop complaining that the user doesn't have write permission to the given path: *Description* I have a Java job in flink-1.4.0 which writes to HDFS in a specific path. After updating to flink-1.4.2 I'm getting the following error from Hadoop complaining that the user doesn't have write permission to the given path: {code:java} WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:xng (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Permission denied: user=user1, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x {code} *NOTE*: * If I run the same job on flink-1.4.0, Error disappears regardless of what version of flink (1.4.0 or 1.4.2) dependencies I have for job * Also if I run the job main method from my IDE and pass the same parameters, I don't get above error. *NOTE*: It seems the problem somehow is in {{flink-1.4.2/lib/flink-shaded-hadoop2-uber-1.4.2.jar}}. If I replace that with {{flink-1.4.0/lib/flink-shaded-hadoop2-uber-1.4.0.jar}}, restart the cluster and run my job (flink topology) then the error doesn't appear. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [ANNOUNCE] Apache Flink 1.3.3 released
Whoops, looks like we forgot to push the release button :) Thank you for notifying us. The artifacts should be available soon. On 20.03.2018 11:35, Philip Luppens wrote: Hi everyone, Thanks, but I don’t see the binaries for 1.3.3 being pushed anywhere in the Maven repositories [1]. Can we expect them to show up over there as well eventually? [1] https://repo.maven.apache.org/maven2/org/apache/flink/flink-java/ Kind regards, -Phil On Fri, Mar 16, 2018 at 3:36 PM, Stephan Ewenwrote: This release fixed a quite critical bug that could lead to loss of checkpoint state: https://issues.apache.org/jira/browse/FLINK-7783 We recommend all users on Flink 1.3.2 to upgrade to 1.3.3 On Fri, Mar 16, 2018 at 10:31 AM, Till Rohrmann wrote: Thanks for managing the release Gordon and also thanks to the community! Cheers, Till On Fri, Mar 16, 2018 at 9:05 AM, Fabian Hueske wrote: Thanks for managing this release Gordon! Cheers, Fabian 2018-03-15 21:24 GMT+01:00 Tzu-Li (Gordon) Tai : The Apache Flink community is very happy to announce the release of Apache Flink 1.3.3, which is the third bugfix release for the Apache Flink 1.3 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. The release is available for download at: https://flink.apache.org/downloads.html Please check out the release blog post for an overview of the improvements for this bugfix release: https://flink.apache.org/news/2018/03/15/release-1.3.3.html The full release notes are available in Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje ctId=12315522=12341142 We would like to thank all contributors of the Apache Flink community who made this release possible! Cheers, Gordon
Re: [ANNOUNCE] Apache Flink 1.3.3 released
Hi everyone, Thanks, but I don’t see the binaries for 1.3.3 being pushed anywhere in the Maven repositories [1]. Can we expect them to show up over there as well eventually? [1] https://repo.maven.apache.org/maven2/org/apache/flink/flink-java/ Kind regards, -Phil On Fri, Mar 16, 2018 at 3:36 PM, Stephan Ewenwrote: > This release fixed a quite critical bug that could lead to loss of > checkpoint state: https://issues.apache.org/jira/browse/FLINK-7783 > > We recommend all users on Flink 1.3.2 to upgrade to 1.3.3 > > > On Fri, Mar 16, 2018 at 10:31 AM, Till Rohrmann > wrote: > >> Thanks for managing the release Gordon and also thanks to the community! >> >> Cheers, >> Till >> >> On Fri, Mar 16, 2018 at 9:05 AM, Fabian Hueske >> wrote: >> >>> Thanks for managing this release Gordon! >>> >>> Cheers, Fabian >>> >>> 2018-03-15 21:24 GMT+01:00 Tzu-Li (Gordon) Tai : >>> The Apache Flink community is very happy to announce the release of Apache Flink 1.3.3, which is the third bugfix release for the Apache Flink 1.3 series. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. The release is available for download at: https://flink.apache.org/downloads.html Please check out the release blog post for an overview of the improvements for this bugfix release: https://flink.apache.org/news/2018/03/15/release-1.3.3.html The full release notes are available in Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje ctId=12315522=12341142 We would like to thank all contributors of the Apache Flink community who made this release possible! Cheers, Gordon >>> >> > -- "We cannot change the cards we are dealt, just how we play the hand." - Randy Pausch
[jira] [Created] (FLINK-9028) flip6 should check config before starting cluster
Sihua Zhou created FLINK-9028: - Summary: flip6 should check config before starting cluster Key: FLINK-9028 URL: https://issues.apache.org/jira/browse/FLINK-9028 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.5.0 Reporter: Sihua Zhou Assignee: Sihua Zhou Fix For: 1.5.0 In flip6, we should perform parameters checking before starting cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)