Re: Too many open files

2018-03-20 Thread Ted Yu
Not sure if you have looked at FLINK-8707

FYI

On Tue, Mar 20, 2018 at 2:13 PM, Govindarajan Srinivasaraghavan <
govindragh...@gmail.com> wrote:

> Hi,
>
> We have a streaming job that runs on flink in docker and checkpointing
> happens every 10 seconds. After several starts and cancellations we are
> facing this issue with file handles.
>
> The job reads data from kafka, processes it and writes it back to kafka and
> we are using RocksDB state backend. For now we have increased the number
> file handles to resolve the problem but would like to know if this is
> expected or is it an issue. Thanks.
>
> java.io.FileNotFoundException:
> /tmp/flink-io-b3043cd6-50c8-446a-8c25-fade1b1862c0/
> cb317fc2578db72b3046468948fa00f2f17039b6104e72fb8c58938e5869cfbc.0.buffer
> (Too many open files)
>
> at java.io.RandomAccessFile.open0(Native Method)
>
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
>
> at java.io.RandomAccessFile.(RandomAccessFile.java:243)
>
> at
> org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(
> BufferSpiller.java:259)
>
> at
> org.apache.flink.streaming.runtime.io.BufferSpiller.<
> init>(BufferSpiller.java:120)
>
> at
> org.apache.flink.streaming.runtime.io.BarrierBuffer.<
> init>(BarrierBuffer.java:149)
>
> at
> org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.
> (StreamTwoInputProcessor.java:147)
>
> at
> org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.init(
> TwoInputStreamTask.java:79)
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:235)
>
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Regards,
> Govind
>


Too many open files

2018-03-20 Thread Govindarajan Srinivasaraghavan
Hi,

We have a streaming job that runs on flink in docker and checkpointing
happens every 10 seconds. After several starts and cancellations we are
facing this issue with file handles.

The job reads data from kafka, processes it and writes it back to kafka and
we are using RocksDB state backend. For now we have increased the number
file handles to resolve the problem but would like to know if this is
expected or is it an issue. Thanks.

java.io.FileNotFoundException:
/tmp/flink-io-b3043cd6-50c8-446a-8c25-fade1b1862c0/cb317fc2578db72b3046468948fa00f2f17039b6104e72fb8c58938e5869cfbc.0.buffer
(Too many open files)

at java.io.RandomAccessFile.open0(Native Method)

at java.io.RandomAccessFile.open(RandomAccessFile.java:316)

at java.io.RandomAccessFile.(RandomAccessFile.java:243)

at
org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel(BufferSpiller.java:259)

at
org.apache.flink.streaming.runtime.io.BufferSpiller.(BufferSpiller.java:120)

at
org.apache.flink.streaming.runtime.io.BarrierBuffer.(BarrierBuffer.java:149)

at
org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.(StreamTwoInputProcessor.java:147)

at
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.init(TwoInputStreamTask.java:79)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:235)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)

at java.lang.Thread.run(Thread.java:748)

Regards,
Govind


Re: [ANNOUNCE] Weekly community update #12

2018-03-20 Thread Stephan Ewen
Great initiative, highly appreciated, Till!


On Mon, Mar 19, 2018 at 7:06 PM, Till Rohrmann  wrote:

> Dear community,
>
> I've noticed that Flink has grown quite a bit in the past. As a
> consequence it can be quite challenging to stay up to date. Especially for
> community members who don't follow Flink's MLs on a daily basis.
>
> In order to keep a bigger part of the community in the loop, I wanted to
> try out a weekly update letter where I update the community with what
> happened from my perspective. Since I also don't know everything I want to
> encourage others to post updates about things they deem important and
> relevant for the community to this thread.
>
> # Weekly update #12:
>
> ## Flink 1.5 release:
> - The Flink community is still working on the Flink 1.5 release. Hopefully
> Flink 1.5 can be released in the next weeks.
> - The main work concentrated last week on stabilizing Flip-6 and adding
> more automated tests [1]. The Flink community appreciates every helping
> hand with adding more end to end tests.
> - Consequently, the committed changes mainly consisted of bug fixes and
> test hardening.
> - By the end of this week, we hope to have a RC ready which can be used
> for easier release testing. Given the big changes (network stack and
> Flip-6), the RC will most likely still contain some rough edges. In order
> to smooth them out, it would be good if we run Flink 1.5 in as many
> different scenarios as possible.
>
> ## Flink 1.3.3. has been released
> - Flink 1.3.3 containing an important fix for properly handling
> checkpoints in case of a DFS problem has been released. We highly recommend
> that all users running Flink 1.3.2 upgrade swiftly to Flink 1.3.3.
>
> ## Misc:
> - Shuyi opened a discussion about improving Flink's security [2]. If you
> are interested and want to help with the next steps please engage in the
> discussion.
>
> PS: Don't worry that you've missed the first 11 weekly community updates.
> It's just this week's number.
>
> [1] http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/ANNOUNCE-Flink-1-5-release-testing-effort-td21646.html
> [2] http://apache-flink-mailing-list-archive.1008284.
> n3.nabble.com/DISCUSS-Flink-security-improvements-td21068.html
>
> Cheers,
> Till
>


Re: FLIP-6 / Kubernetes

2018-03-20 Thread Stephan Ewen
@Eron That is definitely the way we want to suggest as the way to use k8s
in the future. This feature did not make it for 1.5, but should come very
soon after.

@Thomas An implementation of a ResourceManager for k8s should come in the
near future. Would be happy to jump on a joint FLIP, after the 1.5 release
and Flink Forward (in three weeks or so).

On Tue, Mar 20, 2018 at 4:25 PM, Eron Wright  wrote:

>  Till, is it possible to package a Flink application as a self-contained
> deployment on Kubernetes?  I mean, run a Flink application using 'flink
> run' such that it launches its own RM/JM and waits for a sufficient # of
> TMs to join?
>
> Thanks!
>
> On Mon, Mar 19, 2018 at 2:57 AM, Till Rohrmann 
> wrote:
>
> > I forgot to add Flink's K8 documentation [1] which might also be helpful
> > with getting started.
> >
> > [1]
> > https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/
> > kubernetes.html
> >
> > Cheers,
> > Till
> >
> > On Mon, Mar 19, 2018 at 10:54 AM, Till Rohrmann 
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > I think the one way to get started would be to adapt the Flink docker
> > > images [1,2] to run with Flink 1.5. Per default, they will use the
> Flip-6
> > > components.
> > >
> > > Flink 1.5 won't come with a dedicated integration with Kubernetes which
> > is
> > > able to start new pods. However, it should work that you manually or by
> > the
> > > virtues of an external system start new pods which can be used. Once
> the
> > > new pods have been started and the TaskExecutors have registered one
> > would
> > > have to rescale the job manually to the new parallelism.
> > >
> > > [1] https://flink.apache.org/news/2017/05/16/official-docker-
> image.html
> > > [2] https://hub.docker.com/r/_/flink/
> > >
> > > Cheers,
> > > Till
> > >
> > > On Sun, Mar 18, 2018 at 9:05 PM, Thomas Weise  wrote:
> > >
> > >> Hi,
> > >>
> > >> What would be a good starting point to try out Flink on Kubernetes
> (any
> > >> examples / tutorials)?
> > >>
> > >> Also, will the FLIP-6 work in 1.5 enable dynamic scaling on
> Kubernetes?
> > >>
> > >> Thanks,
> > >> Thomas
> > >>
> > >
> > >
> >
>


[jira] [Created] (FLINK-9036) Add default value via suppliers

2018-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-9036:
---

 Summary: Add default value via suppliers
 Key: FLINK-9036
 URL: https://issues.apache.org/jira/browse/FLINK-9036
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.6.0


Earlier versions had a default value in {{ValueState}}. We dropped that, 
because the value would have to be duplicated on each access, to be safe 
against side effects when using mutable types.

For convenience, we should re-add the feature, but using a supplier/factory 
function to create the default value on access.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9035) State Descriptors have broken hashCode() and equals()

2018-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-9035:
---

 Summary: State Descriptors have broken hashCode() and equals()
 Key: FLINK-9035
 URL: https://issues.apache.org/jira/browse/FLINK-9035
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Affects Versions: 1.4.2, 1.5.0
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.6.0


The following code fails with a {{NullPointerException}}:
{code}
ValueStateDescriptor descr = new ValueStateDescriptor<>("name", 
String.class);
descr.hashCode();
{code}
The {{hashCode()}} function tries to access the {{serializer}} field, which may 
be uninitialized at that point.

The {{equals()}} method is equally broken (no pun intended):

{code}
ValueStateDescriptor a = new ValueStateDescriptor<>("name", 
String.class);
ValueStateDescriptor b = new ValueStateDescriptor<>("name", 
String.class);

a.equals(b) // exception
b.equals(a) // exception

a.initializeSerializerUnlessSet(new ExecutionConfig());

a.equals(b) // false
b.equals(a) // exception

b.initializeSerializerUnlessSet(new ExecutionConfig());

a.equals(b) // true
b.equals(a) // true
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9034) State Descriptors drop TypeInformation on serialization

2018-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-9034:
---

 Summary: State Descriptors drop TypeInformation on serialization
 Key: FLINK-9034
 URL: https://issues.apache.org/jira/browse/FLINK-9034
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Affects Versions: 1.4.2, 1.5.0
Reporter: Stephan Ewen
 Fix For: 1.6.0


The following code currently causes problems

{code}
public class MyFunction extends RichMapFunction  {

private final ValueStateDescriptor descr = new 
ValueStateDescriptor<>("state name", MyType.class);

private ValueState state;

@Override
public void open() {
state = getRuntimeContext().getValueState(descr);
}
}
{code}

The problem is that the state descriptor drops the type information and creates 
a serializer before serialization as part of shipping the function in the 
cluster. To do that, it initializes the serializer with an empty execution 
config, making serialization inconsistent.

This is mainly an artifact from the days when dropping the type information 
before shipping was necessary, because the type info was not serializable. It 
now is, and we can fix that bug.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9033) Replace usages of deprecated TASK_MANAGER_NUM_TASK_SLOTS

2018-03-20 Thread Hai Zhou (JIRA)
Hai Zhou created FLINK-9033:
---

 Summary: Replace usages of deprecated TASK_MANAGER_NUM_TASK_SLOTS
 Key: FLINK-9033
 URL: https://issues.apache.org/jira/browse/FLINK-9033
 Project: Flink
  Issue Type: Improvement
  Components: Configuration
Affects Versions: 1.5.0
Reporter: Hai Zhou
Assignee: Hai Zhou
 Fix For: 1.5.0


The deprecated ConfigConstants#TASK_MANAGER_NUM_TASK_SLOTS is still used a lot.

We should replace these usages with TaskManagerOptions#NUM_TASK_SLOTS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: FLIP-6 / Kubernetes

2018-03-20 Thread Eron Wright
 Till, is it possible to package a Flink application as a self-contained
deployment on Kubernetes?  I mean, run a Flink application using 'flink
run' such that it launches its own RM/JM and waits for a sufficient # of
TMs to join?

Thanks!

On Mon, Mar 19, 2018 at 2:57 AM, Till Rohrmann  wrote:

> I forgot to add Flink's K8 documentation [1] which might also be helpful
> with getting started.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/
> kubernetes.html
>
> Cheers,
> Till
>
> On Mon, Mar 19, 2018 at 10:54 AM, Till Rohrmann 
> wrote:
>
> > Hi Thomas,
> >
> > I think the one way to get started would be to adapt the Flink docker
> > images [1,2] to run with Flink 1.5. Per default, they will use the Flip-6
> > components.
> >
> > Flink 1.5 won't come with a dedicated integration with Kubernetes which
> is
> > able to start new pods. However, it should work that you manually or by
> the
> > virtues of an external system start new pods which can be used. Once the
> > new pods have been started and the TaskExecutors have registered one
> would
> > have to rescale the job manually to the new parallelism.
> >
> > [1] https://flink.apache.org/news/2017/05/16/official-docker-image.html
> > [2] https://hub.docker.com/r/_/flink/
> >
> > Cheers,
> > Till
> >
> > On Sun, Mar 18, 2018 at 9:05 PM, Thomas Weise  wrote:
> >
> >> Hi,
> >>
> >> What would be a good starting point to try out Flink on Kubernetes (any
> >> examples / tutorials)?
> >>
> >> Also, will the FLIP-6 work in 1.5 enable dynamic scaling on Kubernetes?
> >>
> >> Thanks,
> >> Thomas
> >>
> >
> >
>


[jira] [Created] (FLINK-9032) Deprecate ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE

2018-03-20 Thread Hai Zhou (JIRA)
Hai Zhou created FLINK-9032:
---

 Summary: Deprecate 
ConfigConstants#YARN_CONTAINER_START_COMMAND_TEMPLATE
 Key: FLINK-9032
 URL: https://issues.apache.org/jira/browse/FLINK-9032
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Hai Zhou
Assignee: Hai Zhou
 Fix For: 1.5.0, 1.6.0


this contants("yarn.container-start-command-template") has disappeared from the 
[1.5.0-SNAPSHOT 
docs|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html].

We should restore it, and I think it should be renamed 
"containerized.start-command-template".

[~Zentol], what do you think?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9030) JobManager fails to archive job to FS when TM is lost

2018-03-20 Thread Jared Stehler (JIRA)
Jared Stehler created FLINK-9030:


 Summary: JobManager fails to archive job to FS when TM is lost
 Key: FLINK-9030
 URL: https://issues.apache.org/jira/browse/FLINK-9030
 Project: Flink
  Issue Type: Bug
  Components: History Server, JobManager, Mesos
Affects Versions: 1.4.0
Reporter: Jared Stehler


We are running flink on mesos, and are finding that when a job fails due to a 
task manager getting lost (from an OOM kill), the job isn't archived properly 
into the history server dir on the filesystem. 

When this happens, the job does appear in the finished listing in the job 
manager's in-memory archive view, and is accessible in the running job 
manager's rest api, but obviously not in the history server's rest api.

This is causing us issues as we are using the history server as a system of 
record for canceled or failed jobs in order to determine previous savepoint / 
external checkpoints.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9029) Getting write permission from HDFS after updating flink-1.40 to flink-1.4.2

2018-03-20 Thread Mohammad Abareghi (JIRA)
Mohammad Abareghi created FLINK-9029:


 Summary: Getting write permission from HDFS after updating 
flink-1.40 to flink-1.4.2
 Key: FLINK-9029
 URL: https://issues.apache.org/jira/browse/FLINK-9029
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.4.2, 1.4.1
 Environment: * Flink-1.4.2 (Flink-1.4.1)
 * Hadoop 2.6.0-cdh5.13.0 with 4 nodes in service and {{Security is off.}}
 * Ubuntu 16.04.3 LTS
 * Java 8
Reporter: Mohammad Abareghi


*Environment*
 * Flink-1.4.2
 * Hadoop 2.6.0-cdh5.13.0 with 4 nodes in service and {{Security is off.}}
 * Ubuntu 16.04.3 LTS
 * Java 8

*Description*

I have a Java job in flink-1.4.0 which writes to HDFS in a specific path. After 
updating to flink-1.4.2 I'm getting the following error from Hadoop complaining 
that the user doesn't have write permission to the given path:

 

*Description*

I have a Java job in flink-1.4.0 which writes to HDFS in a specific path. After 
updating to flink-1.4.2 I'm getting the following error from Hadoop complaining 
that the user doesn't have write permission to the given path:
{code:java}
WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:xng (auth:SIMPLE) 
cause:org.apache.hadoop.security.AccessControlException: Permission denied: 
user=user1, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x
{code}
*NOTE*:
 * If I run the same job on flink-1.4.0, Error disappears regardless of what 
version of flink (1.4.0 or 1.4.2) dependencies I have for job
 * Also if I run the job main method from my IDE and pass the same parameters, 
I don't get above error.

*NOTE*:

It seems the problem somehow is in 
{{flink-1.4.2/lib/flink-shaded-hadoop2-uber-1.4.2.jar}}. If I replace that with 
{{flink-1.4.0/lib/flink-shaded-hadoop2-uber-1.4.0.jar}}, restart the cluster 
and run my job (flink topology) then the error doesn't appear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [ANNOUNCE] Apache Flink 1.3.3 released

2018-03-20 Thread Chesnay Schepler

Whoops, looks like we forgot to push the release button :)

Thank you for notifying us. The artifacts should be available soon.

On 20.03.2018 11:35, Philip Luppens wrote:

Hi everyone,

Thanks, but I don’t see the binaries for 1.3.3 being pushed anywhere in the
Maven repositories [1]. Can we expect them to show up over there as well
eventually?

[1] https://repo.maven.apache.org/maven2/org/apache/flink/flink-java/

Kind regards,

-Phil


On Fri, Mar 16, 2018 at 3:36 PM, Stephan Ewen  wrote:


This release fixed a quite critical bug that could lead to loss of
checkpoint state: https://issues.apache.org/jira/browse/FLINK-7783

We recommend all users on Flink 1.3.2 to upgrade to 1.3.3


On Fri, Mar 16, 2018 at 10:31 AM, Till Rohrmann 
wrote:


Thanks for managing the release Gordon and also thanks to the community!

Cheers,
Till

On Fri, Mar 16, 2018 at 9:05 AM, Fabian Hueske 
wrote:


Thanks for managing this release Gordon!

Cheers, Fabian

2018-03-15 21:24 GMT+01:00 Tzu-Li (Gordon) Tai :


The Apache Flink community is very happy to announce the release of
Apache Flink 1.3.3, which is the third bugfix release for the Apache Flink
1.3 series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the
improvements for this bugfix release:
https://flink.apache.org/news/2018/03/15/release-1.3.3.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12315522=12341142

We would like to thank all contributors of the Apache Flink community
who made this release possible!

Cheers,
Gordon








Re: [ANNOUNCE] Apache Flink 1.3.3 released

2018-03-20 Thread Philip Luppens
Hi everyone,

Thanks, but I don’t see the binaries for 1.3.3 being pushed anywhere in the
Maven repositories [1]. Can we expect them to show up over there as well
eventually?

[1] https://repo.maven.apache.org/maven2/org/apache/flink/flink-java/

Kind regards,

-Phil


On Fri, Mar 16, 2018 at 3:36 PM, Stephan Ewen  wrote:

> This release fixed a quite critical bug that could lead to loss of
> checkpoint state: https://issues.apache.org/jira/browse/FLINK-7783
>
> We recommend all users on Flink 1.3.2 to upgrade to 1.3.3
>
>
> On Fri, Mar 16, 2018 at 10:31 AM, Till Rohrmann 
> wrote:
>
>> Thanks for managing the release Gordon and also thanks to the community!
>>
>> Cheers,
>> Till
>>
>> On Fri, Mar 16, 2018 at 9:05 AM, Fabian Hueske 
>> wrote:
>>
>>> Thanks for managing this release Gordon!
>>>
>>> Cheers, Fabian
>>>
>>> 2018-03-15 21:24 GMT+01:00 Tzu-Li (Gordon) Tai :
>>>
 The Apache Flink community is very happy to announce the release of
 Apache Flink 1.3.3, which is the third bugfix release for the Apache Flink
 1.3 series.

 Apache Flink® is an open-source stream processing framework for
 distributed, high-performing, always-available, and accurate data streaming
 applications.

 The release is available for download at:
 https://flink.apache.org/downloads.html

 Please check out the release blog post for an overview of the
 improvements for this bugfix release:
 https://flink.apache.org/news/2018/03/15/release-1.3.3.html

 The full release notes are available in Jira:
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
 ctId=12315522=12341142

 We would like to thank all contributors of the Apache Flink community
 who made this release possible!

 Cheers,
 Gordon


>>>
>>
>


-- 
"We cannot change the cards we are dealt, just how we play the hand." -
Randy Pausch


[jira] [Created] (FLINK-9028) flip6 should check config before starting cluster

2018-03-20 Thread Sihua Zhou (JIRA)
Sihua Zhou created FLINK-9028:
-

 Summary: flip6 should check config before starting cluster
 Key: FLINK-9028
 URL: https://issues.apache.org/jira/browse/FLINK-9028
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.5.0
Reporter: Sihua Zhou
Assignee: Sihua Zhou
 Fix For: 1.5.0


In flip6, we should perform parameters checking before starting cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)