Re: Are timers in ProcessFunction fault tolerant?

2017-05-25 Thread Moiz S Jinia
A follow on question. Since the registered timers are part of the managed
key state, do the timers get cancelled when i call state.clear()?

Moiz

On Thu, May 25, 2017 at 10:20 PM, Moiz S Jinia  wrote:

> Awesome. Thanks.
>
> On Thu, May 25, 2017 at 10:13 PM, Eron Wright 
> wrote:
>
>> Yes, registered timers are stored in managed keyed state and should be
>> fault-tolerant.
>>
>> -Eron
>>
>> On Thu, May 25, 2017 at 9:28 AM, Moiz S Jinia 
>> wrote:
>>
>>> With a checkpointed RocksDB based state backend, can I expect the
>>> registered processing timers to be fault tolerant? (along with the managed
>>> keyed state).
>>>
>>> Example -
>>> A task manager instance owns the key k1 (from a keyed stream) that has
>>> registered a processing timer with a timestamp thats a day ahead in the
>>> future. If this instance is killed, and the key is moved to another
>>> instance, will the onTimer trigger correctly on the other machine at the
>>> expected time with the same keyed state (for k1)?
>>>
>>> Thanks,
>>> Moiz
>>>
>>
>>
>


invalid type code: 00

2017-05-25 Thread rhashmi
Sprodically i am seeing this error. Any idea?


java.lang.IllegalStateException: Could not initialize keyed state backend.
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:286)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:199)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:664)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:651)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:257)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.StreamCorruptedException: invalid type code: 00
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1381)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at java.util.HashMap.readObject(HashMap.java:1404)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1909)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:503)
at
org.apache.flink.api.java.typeutils.runtime.PojoSerializer.readObject(PojoSerializer.java:130)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1909)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:292)
at
org.apache.flink.api.common.typeutils.TypeSerializerSerializationProxy.read(TypeSerializerSerializationProxy.java:97)
at
org.apache.flink.runtime.state.KeyedBackendSerializationProxy.read(KeyedBackendSerializationProxy.java:88)
at
org.apache.flink.runtime.state.heap.HeapKeyedStateBackend.restorePartitionedState(HeapKeyedStateBackend.java:299)
at
org.apache.flink.runtime.state.heap.HeapKeyedStateBackend.restore(HeapKeyedStateBackend.java:243)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.createKeyedStateBackend(StreamTask.java:799)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:277)
... 6 more



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/invalid-type-code-00-tp13326.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Flink parallel tasks, slots and vcores

2017-05-25 Thread Sathi Chowdhury
Hi Till/ flink-devs,
I am trying to understand why adding slots in the task manager is having no 
impact in performance for the test pipeline.
Here is my flink-conf.yaml
jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.memory.preallocate: false
taskmanager.numberOfTaskSlots: 8
parallelism.default: 8
akka.ask.timeout: 1 s
akka.lookup.timeout: 100 s

I have an EMR cluster of 8 task manager …for now I wanted to use one 
taskmanager with multiple slots.
So I start cluster with
$FLINK_HOME/bin/yarn-session.sh -d  -n 1 -s 8 -tm 57344

then I run my flink job with –p 8
I see only 1 cpu core is being used in this task manager

In yarn web ui , I see the node that is chosen to be the taskmanager (as I gave 
–n 1) ip-10-202-4-14.us-west-2.compute.internal:8041
Under memory avail : 56GB
Under vcores used : 1
Under vcores Avail: 31

I was expecting to see vcores used to be 8

Any clue or hint why I am seeing this, also I am not seeing any performance 
gain(using flink kinesis connector to read a json file and sink to s3 )
It is getting capped with the same performance I have seen with one slot.
Thanks
Sathi

=Notice to Recipient: This e-mail transmission, and any documents, 
files or previous e-mail messages attached to it may contain information that 
is confidential or legally privileged, and intended for the use of the 
individual or entity named above. If you are not the intended recipient, or a 
person responsible for delivering it to the intended recipient, you are hereby 
notified that you must not read this transmission and that any disclosure, 
copying, printing, distribution or use of any of the information contained in 
or attached to this transmission is STRICTLY PROHIBITED. If you have received 
this transmission in error, please immediately notify the sender by telephone 
or return e-mail and delete the original transmission and its attachments 
without reading or saving in any manner. Thank you. =


Re: Collapsible job plan visualization

2017-05-25 Thread Chesnay Schepler
You should be able to move the separator between the plan view and the 
bottom panel already.


On 25.05.2017 19:45, Flavio Pompermaier wrote:

Hi to all,
In our experience the Flink plan diagram is a nice feature but it is 
useless almost all the time and it has an annoying interaction with 
the mouse wheelI suggest to make it a collapsible div. IMHO that 
would be an easy thing that would definitively improve the user 
experience ...what other flinker think about this??


Best,
Flavio





Collapsible job plan visualization

2017-05-25 Thread Flavio Pompermaier
Hi to all,
In our experience the Flink plan diagram is a nice feature but it is
useless almost all the time and it has an annoying interaction with the
mouse wheelI suggest to make it a collapsible div. IMHO that would be
an easy thing that would definitively improve the user experience ...what
other flinker think about this??

Best,
Flavio


Re: Flink and swapping question

2017-05-25 Thread Flavio Pompermaier
Hi to all,
I think we found the root cause of all the problems. Looking ad dmesg there
was a "crazy" total-vm size associated to the OOM error, a LOT much bigger
than the TaskManager's available memory.
In our case, the TM had a max heap of 14 GB while the dmsg error was
reporting a required amount of memory in the order of 60 GB!

[ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
sacrifice child
[ 5331.992619] Killed process 24221 (java) *total-vm:64800680kB*,
anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB

That wasn't definitively possible usin an ordinary JVM (and our TM was
running without off-heap settings) so we've looked at the parameters used
to run the TM JVM and indeed there was a reall huge amount of memory given
to MaxDirectMemorySize. With my big surprise Flink runs a TM with this
parameter set to 8.388.607T..does it make any sense??
Is it documented anywhere the importance of this parameter (and why it is
used in non off-heap mode as well)? Is it related to network buffers?
It should also be documented that this parameter should be added to the TM
heap when reserving memory to Flin (IMHO).

I hope that this painful sessions of Flink troubleshooting could be an
added value sooner or later..

Best,
Flavio

On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier 
wrote:

> I can confirm that after giving less memory to the Flink TM the job was
> able to run successfully.
> After almost 2 weeks of pain, we summarize here our experience with Fink
> in virtualized environments (such as VMWare ESXi):
>
>1. Disable the virtualization "feature" that transfer a VM from a
>(heavy loaded) physical machine to another one (to balance the resource
>consumption)
>2. Check dmesg when a TM dies without logging anything (usually it
>goes OOM and the OS kills it but there you can find the log of this thing)
>3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I
>see the OS starting swapping also if there are 12 out of 32 GB of free
>memory)!
>
> We're still investigating how this behavior could be fixed: the problem is
> that it's better not to disable swapping because otherwise VMWare could
> start ballooning (that is definitely worse...).
>
> I hope this tips could save someone else's day..
>
> Best,
> Flavio
>
> On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier 
> wrote:
>
>> Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill
>> process 13574 (java)".
>> This is really strange because the JVM of the TM is very calm.
>> Moreover, there are 7 GB of memory available (out of 32) but somehow the
>> OS decides to start swapping and, when it runs out of available swap
>> memory, the OS decides to kill the Flink TM :(
>>
>> Any idea of what's going on here?
>>
>> On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier > > wrote:
>>
>>> Hi Greg,
>>> I carefully monitored all TM memory with jstat -gcutil and there'no full
>>> gc, only .
>>> The initial situation on the dying TM is:
>>>
>>>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT
>>>   GCT
>>>   0.00 100.00  33.57  88.74  98.42  97.171592.508 10.255
>>>2.763
>>>   0.00 100.00  90.14  88.80  98.67  97.171972.617 10.255
>>>2.873
>>>   0.00 100.00  27.00  88.82  98.75  97.172342.730 10.255
>>>2.986
>>>
>>> After about 10 hours of processing is:
>>>
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>>   33.267
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>>   33.267
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>>   33.267
>>>
>>> So I don't think thta OOM could be an option.
>>>
>>> However, the cluster is running on ESXi vSphere VMs and we already
>>> experienced unexpected crash of jobs because of ESXi moving a heavy-loaded
>>> VM to another (less loaded) physical machine..I would't be surprised if
>>> swapping is also handled somehow differently..
>>> Looking at Cloudera widgets I see that the crash is usually preceded by
>>> an intense cpu_iowait period.
>>> I fear that Flink unsafe access to memory could be a problem in those
>>> scenarios. Am I wrong?
>>>
>>> Any insight or debugging technique is  greatly appreciated.
>>> Best,
>>> Flavio
>>>
>>>
>>> On Wed, May 24, 2017 at 2:11 PM, Greg Hogan  wrote:
>>>
 Hi Flavio,

 Flink handles interrupts so the only silent killer I am aware of is
 Linux's OOM killer. Are you seeing such a message in dmesg?

 Greg

 On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:

> Hi to all,
> I'd like to know whether memory swapping could cause a taskmanager
> crash.
> In my cluster of virtual machines 'm seeing this strange behavior in
> my Flink cluster: sometimes, if memory get swapped the 

Re: Are timers in ProcessFunction fault tolerant?

2017-05-25 Thread Moiz S Jinia
Awesome. Thanks.

On Thu, May 25, 2017 at 10:13 PM, Eron Wright  wrote:

> Yes, registered timers are stored in managed keyed state and should be
> fault-tolerant.
>
> -Eron
>
> On Thu, May 25, 2017 at 9:28 AM, Moiz S Jinia 
> wrote:
>
>> With a checkpointed RocksDB based state backend, can I expect the
>> registered processing timers to be fault tolerant? (along with the managed
>> keyed state).
>>
>> Example -
>> A task manager instance owns the key k1 (from a keyed stream) that has
>> registered a processing timer with a timestamp thats a day ahead in the
>> future. If this instance is killed, and the key is moved to another
>> instance, will the onTimer trigger correctly on the other machine at the
>> expected time with the same keyed state (for k1)?
>>
>> Thanks,
>> Moiz
>>
>
>


Re: Are timers in ProcessFunction fault tolerant?

2017-05-25 Thread Tzu-Li (Gordon) Tai
Hi Moiz!

Adding a bit of more detail here:
Yes, the timer will be restored on whatever new instance is responsible for 
that key.
There is one “gotcha” to look out for, though: the firing time of timers are 
absolute; what this means is that if the checkpoints timer’s firing processing 
timestamp is t (which is basically the registering time + configured trigger 
time), then it will fire also at processing timestamp t on the new instance. 
Therefore, you should be aware of out-of-sync clocks between the 2 instances.

Another thing to note is that if the job isn’t running at t (when the timer is 
supposed to fire), then on restore, that timer is fired immediately.

Cheers,
Gordon

On 26 May 2017 at 12:44:00 AM, Eron Wright (eronwri...@gmail.com) wrote:

Yes, registered timers are stored in managed keyed state and should be 
fault-tolerant. 

-Eron

On Thu, May 25, 2017 at 9:28 AM, Moiz S Jinia  wrote:
With a checkpointed RocksDB based state backend, can I expect the registered 
processing timers to be fault tolerant? (along with the managed keyed state).

Example -
A task manager instance owns the key k1 (from a keyed stream) that has 
registered a processing timer with a timestamp thats a day ahead in the future. 
If this instance is killed, and the key is moved to another instance, will the 
onTimer trigger correctly on the other machine at the expected time with the 
same keyed state (for k1)?

Thanks,
Moiz



Re: Are timers in ProcessFunction fault tolerant?

2017-05-25 Thread Eron Wright
Yes, registered timers are stored in managed keyed state and should be
fault-tolerant.

-Eron

On Thu, May 25, 2017 at 9:28 AM, Moiz S Jinia  wrote:

> With a checkpointed RocksDB based state backend, can I expect the
> registered processing timers to be fault tolerant? (along with the managed
> keyed state).
>
> Example -
> A task manager instance owns the key k1 (from a keyed stream) that has
> registered a processing timer with a timestamp thats a day ahead in the
> future. If this instance is killed, and the key is moved to another
> instance, will the onTimer trigger correctly on the other machine at the
> expected time with the same keyed state (for k1)?
>
> Thanks,
> Moiz
>


Are timers in ProcessFunction fault tolerant?

2017-05-25 Thread Moiz S Jinia
With a checkpointed RocksDB based state backend, can I expect the
registered processing timers to be fault tolerant? (along with the managed
keyed state).

Example -
A task manager instance owns the key k1 (from a keyed stream) that has
registered a processing timer with a timestamp thats a day ahead in the
future. If this instance is killed, and the key is moved to another
instance, will the onTimer trigger correctly on the other machine at the
expected time with the same keyed state (for k1)?

Thanks,
Moiz


Re: Kafka 0.10 jaas multiple clients

2017-05-25 Thread Eron Wright
Gordon's suggestion seems like a good way to provide per-job credentials
based on application-specific properties.   In contrast, Flink's built-in
JAAS features are aimed at making the Flink cluster's Kerberos credentials
available to jobs.

I want to reiterate that all jobs (for a given Flink cluster) run with full
privilege in the JVM, and that Flink does not guarantee isolation (of
security information) between jobs.   My suggestion is to not run untrusted
job code in a shared Flink cluster.

On Wed, May 24, 2017 at 8:46 PM, Tzu-Li (Gordon) Tai 
wrote:

> Hi Gwenhael,
>
> Follow-up for this:
>
> Turns out what you require is already available with Kafka 0.10, using
> dynamic JAAS configurations [1] instead of a static JAAS file like what
> you’re currently doing.
>
> The main thing to do is to set a “sasl.jaas.config” in the config
> properties for your individual Kafka consumer / producer.
> This will override any static JAAS configuration used.
> Note 2 things here: 1) static JAAS configurations are a JVM process-wide
> installation, meaning using that any separate Kafka client within the same
> process can always only share the same credentials and 2) the “KafkaClient”
> is a fixed JAAS lookup section key that the Kafka clients use, which I
> don’t think is modifiable. So using the static config approach would never
> work.
>
> An example “sasl.jaas.config” for plain logins:
> "org.apache.kafka.common.security.plain.PlainLoginModule required
> username= password=
>
> Simply have different values for each of the Kafka consumer / producers
> you’re using.
>
> Cheers,
> Gordon
>
>
> On 8 May 2017 at 4:42:07 PM, Tzu-Li (Gordon) Tai (tzuli...@apache.org)
> wrote:
>
> Hi Gwenhael,
>
> Sorry for the very long delayed response on this.
>
> As you noticed, the “KafkaClient” entry name seems to be a hardcoded thing
> on the Kafka side, so currently I don’t think what you’re asking for is
> possible.
>
> It seems like this could be made possible with some of the new
> authentication features in Kafka 0.10 that seems related: [1] [2].
>
> I’m not that deep into the authentication modules, but I’ll take a look
> and can keep you posted on this.
> Also looping in Eron (in CC) who could perhaps provide more insight on
> this at the same time.
>
> Cheers,
> Gordon
>
> [1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 83+-+Allow+multiple+SASL+authenticated+Java+clients+in+
> a+single+JVM+process
> [2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 85%3A+Dynamic+JAAS+configuration+for+Kafka+clients
>
> On 26 April 2017 at 8:48:20 PM, Gwenhael Pasquiers (
> gwenhael.pasqui...@ericsson.com) wrote:
>
> Hello,
>
> Up to now we’ve been using kafka with jaas (plain login/password) the
> following way:
>
> -  yarnship the jaas file
>
> -  add the jaas file name into “flink-conf.yaml” using property
> “env.java.opts”
>
>
>
> How to support multiple secured kafka 0.10 consumers and producers (with
> different logins and password of course) ?
>
> From what I saw in the kafka sources, the entry name “KafkaClient” is
> hardcoded…
>
> Best Regards,
>
>
>
> Gwenhaël PASQUIERS
>
>


Re: How can I handle backpressure with event time.

2017-05-25 Thread Eron Wright
Try setting the assigner on the Kafka consumer, rather than on the
DataStream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/
kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission

I believe this will produce a per-partition assigner and forward only the
minimum watermark across all partitions.

Hope this helps,
-Eron

On Thu, May 25, 2017 at 3:21 AM, yunfan123 
wrote:

> For example, I want to merge two kafka topics (named topicA and topicB) by
> the specific key with a max timeout.
> I use event time and class BoundedOutOfOrdernessTimestampExtractor to
> generate water mark.
> When some partitions of topicA be delayed by backpressure, and the delays
> exceeds my max timeout.
> It results in all of my delayed partition in topicA (also corresponding
> data
> in topicB) can't be merged.
> What I want is if backpressure happens, consumers can only consume depends
> on my event time.
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/How-can-I-
> handle-backpressure-with-event-time-tp13313.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


New "Powered by Flink" success case

2017-05-25 Thread Rosellini, Luca
Hello everybody,
I am posting this here following the guidelines I've found in the "Powered
by Flink" page.

At KEEDIO (http://www.keedio.com) we use Apache Flink CEP API in a log
aggregation solution 
for Red Hat OpenStack that enables us to discover operational anomalies
that otherwise would be very hard to spot.

We think it would be interesting to add this use case in the "Powered by
Flink" page.

Let me know if you need further information.

Thanks in advance,

*KEEDIO*

*Luca Rosellini*

*+34 667 24 38 57 <+34%20667%2024%2038%2057>*

*www.keedio.com *

C/ Virgilio 25, Pozuelo de Alarcón


Re: Flink and swapping question

2017-05-25 Thread Flavio Pompermaier
I can confirm that after giving less memory to the Flink TM the job was
able to run successfully.
After almost 2 weeks of pain, we summarize here our experience with Fink in
virtualized environments (such as VMWare ESXi):

   1. Disable the virtualization "feature" that transfer a VM from a (heavy
   loaded) physical machine to another one (to balance the resource
   consumption)
   2. Check dmesg when a TM dies without logging anything (usually it goes
   OOM and the OS kills it but there you can find the log of this thing)
   3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I see
   the OS starting swapping also if there are 12 out of 32 GB of free memory)!

We're still investigating how this behavior could be fixed: the problem is
that it's better not to disable swapping because otherwise VMWare could
start ballooning (that is definitely worse...).

I hope this tips could save someone else's day..

Best,
Flavio

On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier 
wrote:

> Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill
> process 13574 (java)".
> This is really strange because the JVM of the TM is very calm.
> Moreover, there are 7 GB of memory available (out of 32) but somehow the
> OS decides to start swapping and, when it runs out of available swap
> memory, the OS decides to kill the Flink TM :(
>
> Any idea of what's going on here?
>
> On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier 
> wrote:
>
>> Hi Greg,
>> I carefully monitored all TM memory with jstat -gcutil and there'no full
>> gc, only .
>> The initial situation on the dying TM is:
>>
>>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT
>>   GCT
>>   0.00 100.00  33.57  88.74  98.42  97.171592.508 10.255
>>2.763
>>   0.00 100.00  90.14  88.80  98.67  97.171972.617 10.255
>>2.873
>>   0.00 100.00  27.00  88.82  98.75  97.172342.730 10.255
>>2.986
>>
>> After about 10 hours of processing is:
>>
>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>   33.267
>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>   33.267
>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011 10.255
>>   33.267
>>
>> So I don't think thta OOM could be an option.
>>
>> However, the cluster is running on ESXi vSphere VMs and we already
>> experienced unexpected crash of jobs because of ESXi moving a heavy-loaded
>> VM to another (less loaded) physical machine..I would't be surprised if
>> swapping is also handled somehow differently..
>> Looking at Cloudera widgets I see that the crash is usually preceded by
>> an intense cpu_iowait period.
>> I fear that Flink unsafe access to memory could be a problem in those
>> scenarios. Am I wrong?
>>
>> Any insight or debugging technique is  greatly appreciated.
>> Best,
>> Flavio
>>
>>
>> On Wed, May 24, 2017 at 2:11 PM, Greg Hogan  wrote:
>>
>>> Hi Flavio,
>>>
>>> Flink handles interrupts so the only silent killer I am aware of is
>>> Linux's OOM killer. Are you seeing such a message in dmesg?
>>>
>>> Greg
>>>
>>> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
 Hi to all,
 I'd like to know whether memory swapping could cause a taskmanager
 crash.
 In my cluster of virtual machines 'm seeing this strange behavior in my
 Flink cluster: sometimes, if memory get swapped the taskmanager (on that
 machine) dies unexpectedly without any log about the error.

 Is that possible or not?

 Best,
 Flavio

>>>
>>>
>