date:20160825

Elias Levy created FLINK-4502:
-

 Summary: Cassandra connector documentation has misleading 
consistency guarantees
 Key: FLINK-4502
 URL: https://issues.apache.org/jira/browse/FLINK-4502
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Elias Levy


The Cassandra connector documentation states that  "enableWriteAheadLog() is an 
optional method, that allows exactly-once processing for non-deterministic 
algorithms."  This claim appears to be false.

>From what I gather, the write ahead log feature of the connector works as 
>follows:
- The sink is replaced with a stateful operator that writes incoming messages 
to the state backend based on checkpoint they belong in.
- When the operator is notified that a Flink checkpoint has been completed it, 
for each set of checkpoints older than and including the committed one:
  * reads its messages from the state backend
  * writes them to Cassandra
  * records that it has committed them to Cassandra for the specific checkpoint 
and operator instance
   * and erases them from the state backend.

This process attempts to avoid resubmitting queries to Cassandra that would 
otherwise occur when recovering a job from a checkpoint and having messages 
replayed.

Alas, this does not guarantee exactly once semantics at the sink.  The writes 
to Cassandra that occur when the operator is notified that checkpoint is 
completed are not atomic and they are potentially non-idempotent.  If the job 
dies while writing to Cassandra or before committing the checkpoint via 
committer, queries will be replayed when the job recovers.  Thus the 
documentation appear to be incorrect in stating this provides exactly-once 
semantics.

There also seems to be an issue in GenericWriteAheadSink's 
notifyOfCompletedCheckpoint which may result in incorrect output.  If 
sendValues returns false because a write failed, instead of bailing, it simply 
moves on to the next checkpoint to commit if there is one, keeping the previous 
one around to try again later.  But that can result in newer data being 
overwritten with older data when the previous checkpoint is retried.  Although 
given that CassandraCommitter implements isCheckpointCommitted as checkpointID 
<= this.lastCommittedCheckpointID, it actually means that when it goes back to 
try the uncommitted older checkpoint it will consider it committed, even though 
some of its data may not have been written out, and the data will be discarded.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4501) Cassandra sink can lose messages

Elias Levy created FLINK-4501:
-

 Summary: Cassandra sink can lose messages
 Key: FLINK-4501
 URL: https://issues.apache.org/jira/browse/FLINK-4501
 Project: Flink
  Issue Type: Bug
  Components: Cassandra Connector
Affects Versions: 1.1.0
Reporter: Elias Levy


The problem is the same as I pointed out with the Kafka producer sink 
(FLINK-4027).  The CassandraTupleSink's send() and CassandraPojoSink's send() 
both send data asynchronously to Cassandra and record whether an error occurs 
via a future callback.  But CassandraSinkBase does not implement Checkpointed, 
so it can't stop checkpoint from happening even though the are Cassandra 
queries in flight from the checkpoint that may fail.  If they do fail, they 
would subsequently not be replayed when the job recovered, and would thus be 
lost.

In addition, 
CassandraSinkBase's close should check whether there is a pending exception and 
throw it, rather than silently close.  It should also wait for any pending 
async queries to complete and check their status before closing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4500) Cassandra sink can lose messages

Elias Levy created FLINK-4500:
-

 Summary: Cassandra sink can lose messages
 Key: FLINK-4500
 URL: https://issues.apache.org/jira/browse/FLINK-4500
 Project: Flink
  Issue Type: Bug
  Components: Cassandra Connector
Affects Versions: 1.1.0
Reporter: Elias Levy


The problem is the same as I pointed out with the Kafka producer sink 
(FLINK-4027).  The CassandraTupleSink's send() and CassandraPojoSink's send() 
both send data asynchronously to Cassandra and record whether an error occurs 
via a future callback.  But CassandraSinkBase does not implement Checkpointed, 
so it can't stop checkpoint from happening even though the are Cassandra 
queries in flight from the checkpoint that may fail.  If they do fail, they 
would subsequently not be replayed when the job recovered, and would thus be 
lost.

In addition, 
CassandraSinkBase's close should check whether there is a pending exception and 
throw it, rather than silently close.  It should also wait for any pending 
async queries to complete and check their status before closing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Task manager processes crashing one after the other

Stephan,

I ported the fix for the concurrency issue from the Flink commit so now
that should be fine. I ran some fail/restore tests and that specific issue
hasn't appeared again.

However I now get many segfaults in the initializeForJob method where the
RocksDb instance is opened. Just for the record this is the same exact code
as we have in Flink now.:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f12b018f51f, pid=12576, tid=139668190197504
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x7b51f]
...
Stack: [0x7f0708ccf000,0x7f0708dd],  sp=0x7f0708dccd20,
 free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
C  [libc.so.6+0x7b51f]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j
 
org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;I)Ljava/util/List;+0
j
 
org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23
j
 com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.initializeForJob...

And this happens fairly frequently when the jobs are restarting after
failure.

Cheers,
Gyula

Gyula Fóra  ezt írta (időpont: 2016. aug. 25., Cs,
19:07):

> Yes seems like that, I remember the fix in Flink. I apparently made a
> mistake somewhere in our code :)
>
> Thanks,
> Gyula
>
> On Thu, Aug 25, 2016, 18:59 Stephan Ewen  wrote:
>
>> We saw some crashes in earlier versions when native handles in RocksDB
>> (even for config option objects) were manually and too eagerly released.
>>
>> Maybe you have a similar issue here?
>>
>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra  wrote:
>>
>> > Hi,
>> > This seems to be a sneaky concurrency issue in our custom statebackend
>> > implementation.
>> >
>> > I made some changes, will keep you posted.
>> >
>> > Cheers,
>> > Gyula
>> >
>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra  wrote:
>> >
>> > > Hi,
>> > >
>> > > Sure I am sending the TM logs in priv.
>> > >
>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's see
>> if
>> > > that helps.
>> > >
>> > > Cheers,
>> > > Gyula
>> > >
>> > > Till Rohrmann  ezt írta (időpont: 2016. aug.
>> 25.,
>> > > Cs, 10:35):
>> > >
>> > >> Hi Gyula,
>> > >>
>> > >> I haven't seen this problem before. Do you have the logs of the
>> failed
>> > TMs
>> > >> so that we have some more context what was going on?
>> > >>
>> > >> Cheers,
>> > >> Till
>> > >>
>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra 
>> wrote:
>> > >>
>> > >> > Hi guys,
>> > >> >
>> > >> > For quite some time now we fairly frequently experience a task
>> manager
>> > >> > crashes around the time new streaming jobs are deployed. We use
>> > RocksDB
>> > >> > backend so this might be related.
>> > >> >
>> > >> > We tried changing the GC from G1 to CMS that didnt help.
>> > >> >
>> > >> > Yesterday for instance 6 task managers crashed one ofter the other
>> > with
>> > >> > similar errors:
>> > >> >
>> > >> > *** Error in `java': double free or corruption (!prev):
>> > >> 0x7fac0414d760
>> > >> > ***
>> > >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0
>> ***
>> > >> > *** Error in `java': double free or corruption (!prev):
>> > >> 0x7f15247f9a90
>> > >> > ***
>> > >> > ...
>> > >> >
>> > >> > Does anyone have any clue what might cause this or how to debug?
>> > >> > This is very a critical issue :(
>> > >> >
>> > >> > Cheers,
>> > >> > Gyula
>> > >> >
>> > >>
>> > >
>> >
>>
>

Fwd: Enabling Encryption between slaves in Flink

2016-08-25 Thread Vinay Patil

Hi,

I have a requirement that all the data flowing between the task managers
should be encrypted, is there a way in Flink to do that.

Can we use the configuration file to enable this as follows :
http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#Remoting_Sample

or do we need to add the above configurations in code here :
https://github.com/apache/flink/blob/master/flink-
runtime/src/main/scala/org/apache/flink/runtime/akka/AkkaUtils.scala

I have looked at this mail thread , but wanted to get clear understanding
of how we can achieve it
http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_
v0pqfvfto478ft5cbgm-bf-do742gouz528bw7vrj...@mail.gmail.com%3E



Regards,
Vinay Patil

Re: Additional project downloads

2016-08-25 Thread Stephan Ewen

The downloads would be just the components' jar files, or everything?

At some point, someone suggested to add the jars of all libraries (gelly,
ml, ...) and connectors into the download tarball:

  - bin/
  - conf/
  - lib/ (core flink runtime and apis)
  - libraries/
   +-gelly/
   +-CEP/
   +-...
  - examples/

Would that be interesting to people?


On Wed, Aug 24, 2016 at 5:26 PM, Till Rohrmann  wrote:

> I agree that it would be good to offer these kind of convenience download
> links.
>
> On Wed, Aug 24, 2016 at 5:25 PM, Robert Metzger 
> wrote:
>
> > Maybe we should put a link to maven central. We could parameterize the
> link
> > so that it always links to the current release linked on our downloads
> > page.
> >
> > On Wed, Aug 24, 2016 at 5:04 PM, Greg Hogan  wrote:
> >
> > > Hi,
> > >
> > > Should Flink add-ons such as CEP, Gelly, ML, and the optional Metrics
> > > Reporters be available from the download page? Is the alternative to
> > direct
> > > users to Maven Central?
> > >
> > > Greg
> > >
> >
>

[jira] [Created] (FLINK-4499) Introduce findbugs maven plugin

2016-08-25 Thread Ted Yu (JIRA)

Ted Yu created FLINK-4499:
-

 Summary: Introduce findbugs maven plugin
 Key: FLINK-4499
 URL: https://issues.apache.org/jira/browse/FLINK-4499
 Project: Flink
  Issue Type: Improvement
Reporter: Ted Yu


As suggested by Stephan in FLINK-4482, this issue is to add 
findbugs-maven-plugin into the build process so that we can detect lack of 
proper locking and other defects automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Task manager processes crashing one after the other

Yes seems like that, I remember the fix in Flink. I apparently made a
mistake somewhere in our code :)

Thanks,
Gyula

On Thu, Aug 25, 2016, 18:59 Stephan Ewen  wrote:

> We saw some crashes in earlier versions when native handles in RocksDB
> (even for config option objects) were manually and too eagerly released.
>
> Maybe you have a similar issue here?
>
> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra  wrote:
>
> > Hi,
> > This seems to be a sneaky concurrency issue in our custom statebackend
> > implementation.
> >
> > I made some changes, will keep you posted.
> >
> > Cheers,
> > Gyula
> >
> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra  wrote:
> >
> > > Hi,
> > >
> > > Sure I am sending the TM logs in priv.
> > >
> > > Currently what I did was to bump the Rocks version to 4.9.0 let's see
> if
> > > that helps.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > Till Rohrmann  ezt írta (időpont: 2016. aug.
> 25.,
> > > Cs, 10:35):
> > >
> > >> Hi Gyula,
> > >>
> > >> I haven't seen this problem before. Do you have the logs of the failed
> > TMs
> > >> so that we have some more context what was going on?
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra 
> wrote:
> > >>
> > >> > Hi guys,
> > >> >
> > >> > For quite some time now we fairly frequently experience a task
> manager
> > >> > crashes around the time new streaming jobs are deployed. We use
> > RocksDB
> > >> > backend so this might be related.
> > >> >
> > >> > We tried changing the GC from G1 to CMS that didnt help.
> > >> >
> > >> > Yesterday for instance 6 task managers crashed one ofter the other
> > with
> > >> > similar errors:
> > >> >
> > >> > *** Error in `java': double free or corruption (!prev):
> > >> 0x7fac0414d760
> > >> > ***
> > >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 ***
> > >> > *** Error in `java': double free or corruption (!prev):
> > >> 0x7f15247f9a90
> > >> > ***
> > >> > ...
> > >> >
> > >> > Does anyone have any clue what might cause this or how to debug?
> > >> > This is very a critical issue :(
> > >> >
> > >> > Cheers,
> > >> > Gyula
> > >> >
> > >>
> > >
> >
>

Re: Task manager processes crashing one after the other

2016-08-25 Thread Stephan Ewen

We saw some crashes in earlier versions when native handles in RocksDB
(even for config option objects) were manually and too eagerly released.

Maybe you have a similar issue here?

On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra  wrote:

> Hi,
> This seems to be a sneaky concurrency issue in our custom statebackend
> implementation.
>
> I made some changes, will keep you posted.
>
> Cheers,
> Gyula
>
> On Thu, Aug 25, 2016, 10:54 Gyula Fóra  wrote:
>
> > Hi,
> >
> > Sure I am sending the TM logs in priv.
> >
> > Currently what I did was to bump the Rocks version to 4.9.0 let's see if
> > that helps.
> >
> > Cheers,
> > Gyula
> >
> > Till Rohrmann  ezt írta (időpont: 2016. aug. 25.,
> > Cs, 10:35):
> >
> >> Hi Gyula,
> >>
> >> I haven't seen this problem before. Do you have the logs of the failed
> TMs
> >> so that we have some more context what was going on?
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra  wrote:
> >>
> >> > Hi guys,
> >> >
> >> > For quite some time now we fairly frequently experience a task manager
> >> > crashes around the time new streaming jobs are deployed. We use
> RocksDB
> >> > backend so this might be related.
> >> >
> >> > We tried changing the GC from G1 to CMS that didnt help.
> >> >
> >> > Yesterday for instance 6 task managers crashed one ofter the other
> with
> >> > similar errors:
> >> >
> >> > *** Error in `java': double free or corruption (!prev):
> >> 0x7fac0414d760
> >> > ***
> >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 ***
> >> > *** Error in `java': double free or corruption (!prev):
> >> 0x7f15247f9a90
> >> > ***
> >> > ...
> >> >
> >> > Does anyone have any clue what might cause this or how to debug?
> >> > This is very a critical issue :(
> >> >
> >> > Cheers,
> >> > Gyula
> >> >
> >>
> >
>

[jira] [Created] (FLINK-4498) Better Cassandra sink documentation

Elias Levy created FLINK-4498:
-

 Summary: Better Cassandra sink documentation
 Key: FLINK-4498
 URL: https://issues.apache.org/jira/browse/FLINK-4498
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Elias Levy


The Cassandra sink documentation is somewhat muddled and could be improved.  
For instance, the fact that is only supports tuples and POJO's that use 
DataStax Mapper annotations is only mentioned in passing, and it is not clear 
that the reference to tuples only applies to Flink Java tuples and not Scala 
tuples.  

The documentation also does not mention that setQuery() is only necessary for 
tuple streams.  It would be good to have an example of a POJO stream with the 
DataStax annotations.

The explanation of the write ahead log could use some cleaning up to clarify 
when it is appropriate to use, ideally with an example.  Maybe this would be 
best as a blog post to expand on the type of non-deterministic streams this 
applies to.

It would also be useful to mention that tuple elements will be mapped to 
Cassandra columns using the Datastax Java driver's default encoders, which are 
somewhat limited (e.g. to write to a blob column the type in the tuple must be 
a java.nio.ByteBuffer and not just a byte[]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Task manager processes crashing one after the other

Hi,
This seems to be a sneaky concurrency issue in our custom statebackend
implementation.

I made some changes, will keep you posted.

Cheers,
Gyula

On Thu, Aug 25, 2016, 10:54 Gyula Fóra  wrote:

> Hi,
>
> Sure I am sending the TM logs in priv.
>
> Currently what I did was to bump the Rocks version to 4.9.0 let's see if
> that helps.
>
> Cheers,
> Gyula
>
> Till Rohrmann  ezt írta (időpont: 2016. aug. 25.,
> Cs, 10:35):
>
>> Hi Gyula,
>>
>> I haven't seen this problem before. Do you have the logs of the failed TMs
>> so that we have some more context what was going on?
>>
>> Cheers,
>> Till
>>
>> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra  wrote:
>>
>> > Hi guys,
>> >
>> > For quite some time now we fairly frequently experience a task manager
>> > crashes around the time new streaming jobs are deployed. We use RocksDB
>> > backend so this might be related.
>> >
>> > We tried changing the GC from G1 to CMS that didnt help.
>> >
>> > Yesterday for instance 6 task managers crashed one ofter the other with
>> > similar errors:
>> >
>> > *** Error in `java': double free or corruption (!prev):
>> 0x7fac0414d760
>> > ***
>> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 ***
>> > *** Error in `java': double free or corruption (!prev):
>> 0x7f15247f9a90
>> > ***
>> > ...
>> >
>> > Does anyone have any clue what might cause this or how to debug?
>> > This is very a critical issue :(
>> >
>> > Cheers,
>> > Gyula
>> >
>>
>

[jira] [Created] (FLINK-4497) Add support for Scala tuples and case classes to Cassandra sink

Elias Levy created FLINK-4497:
-

 Summary: Add support for Scala tuples and case classes to 
Cassandra sink
 Key: FLINK-4497
 URL: https://issues.apache.org/jira/browse/FLINK-4497
 Project: Flink
  Issue Type: Improvement
  Components: Cassandra Connector
Affects Versions: 1.1.0
Reporter: Elias Levy


The new Cassandra sink only supports streams of Flink Java tuples and Java 
POJOs that have been annotated for use by Datastax Mapper.  The sink should be 
extended to support Scala types and case classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4496) Refactor the TimeServiceProvider to take a Trigerable instead of a Runnable.

2016-08-25 Thread Kostas Kloudas (JIRA)

Kostas Kloudas created FLINK-4496:
-

 Summary: Refactor the TimeServiceProvider to take a Trigerable 
instead of a Runnable.
 Key: FLINK-4496
 URL: https://issues.apache.org/jira/browse/FLINK-4496
 Project: Flink
  Issue Type: Sub-task
Reporter: Kostas Kloudas
Assignee: Kostas Kloudas






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4495) Running multiple jobs on yarn (without yarn-session)

Niels Basjes created FLINK-4495:
---

 Summary: Running multiple jobs on yarn (without yarn-session)
 Key: FLINK-4495
 URL: https://issues.apache.org/jira/browse/FLINK-4495
 Project: Flink
  Issue Type: Improvement
  Components: YARN Client
Reporter: Niels Basjes


I created a small application that needs to run multiple (batch) jobs on Yarn 
and then terminate.

I essentially do right now the following:

flink run -m yarn-cluster -yn 10  bla.jar ...

And in my main I do

foreach thing I need to do {
   ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
   env. ... define the batch job.
   env.execute
}

In the second job I submit I get an exception:
{code}
java.lang.RuntimeException: Unable to tell application master to stop once the 
specified job has been finised
at 
org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:184)
at 
org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:202)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:389)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:376)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:61)
at com.bol.tools.hbase.export.Main.runSingleTable(Main.java:220)
at com.bol.tools.hbase.export.Main.run(Main.java:81)
at com.bol.tools.hbase.export.Main.main(Main.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:509)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:331)
at 
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:775)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:251)
at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:995)
at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:992)
at 
org.apache.flink.runtime.security.SecurityUtils$1.run(SecurityUtils.java:56)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.flink.runtime.security.SecurityUtils.runSecured(SecurityUtils.java:53)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:992)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1046)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
[1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at scala.concurrent.Await.result(package.scala)
at 
org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:182)
... 25 more

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4494) Expose the TimeServiceProvider from the Task to each Operator.

2016-08-25 Thread Kostas Kloudas (JIRA)

Kostas Kloudas created FLINK-4494:
-

 Summary: Expose the TimeServiceProvider from the Task to each 
Operator.
 Key: FLINK-4494
 URL: https://issues.apache.org/jira/browse/FLINK-4494
 Project: Flink
  Issue Type: Bug
Reporter: Kostas Kloudas


This change aims at simplifying the {{StreamTask}} class by exposing directly 
the {{TimeServiceProvider}} to the operators being executed. 

This implies removing the {{registerTimer()}} and 
{{getCurrentProcessingTime()}} methods from the {{StreamTask}}. Now, to 
register a timer and query the time, each operator will be able to get the 
{{TimeServiceProvider}} and call the corresponding methods directly on it.

In addition, this will simplify many of the tests which now implement their own 
time providers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4493) Unify the snapshot output format for keyed-state backends

2016-08-25 Thread Stefan Richter (JIRA)

Stefan Richter created FLINK-4493:
-

 Summary: Unify the snapshot output format for keyed-state backends
 Key: FLINK-4493
 URL: https://issues.apache.org/jira/browse/FLINK-4493
 Project: Flink
  Issue Type: Improvement
Reporter: Stefan Richter
Priority: Minor


We could unify the output format for keyed-state backends implementations, e.g. 
based on RocksDB and Heap, to write a single, common output format.

For example, this would allow us to restore a state that was previously kept in 
RocksDB on a heap-located backend and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4490) Decouple Slot and Instance

2016-08-25 Thread Kurt Young (JIRA)

Kurt Young created FLINK-4490:
-

 Summary: Decouple Slot and Instance
 Key: FLINK-4490
 URL: https://issues.apache.org/jira/browse/FLINK-4490
 Project: Flink
  Issue Type: Sub-task
Reporter: Kurt Young


Currently, {{Slot}} and {{Instance}} holds each other. For {{Instance}} holding 
{{Slot}}, it makes sense because it reflects how many resources it can provide 
and how many are using. 

But it's not very necessary for {{Slot}} to hold {{Instance}} which it belongs 
to. It only needs to hold some connection information and gateway to talk to. 
Another downside for {{Slot}} holding {{Instance}} is that {{Instance}} 
actually contains some allocate/de-allocation logicals, it will be difficult if 
we want to do some allocation refactor without letting {{Slot}} noticed. 

We should abstract the connection information of {{Instance}} to let {{Slot}} 
holds. (Actually we have {{InstanceConnectionInfo}} now, but lacks of 
instance's akka gateway, maybe we can just adding the akka gateway to the 
{{InstanceConnectionInfo}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4491) Handle index.number_of_shards in the ES connector

2016-08-25 Thread Flavio Pompermaier (JIRA)

Flavio Pompermaier created FLINK-4491:
-

 Summary: Handle index.number_of_shards in the ES connector
 Key: FLINK-4491
 URL: https://issues.apache.org/jira/browse/FLINK-4491
 Project: Flink
  Issue Type: Improvement
  Components: Streaming Connectors
Affects Versions: 1.1.0
Reporter: Flavio Pompermaier
Priority: Minor


At the moment is not possible to configure the number of shards if an index 
does not already exists on the Elasticsearch cluster. It could be a great 
improvement to handle the index.number_of_shards (passed in the configuration 
object). E.g.:

{code:java}
Map config = Maps.newHashMap();
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "my-cluster-name");
config.put("index.number_of_shards", "1");
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4492) Cleanup files from canceled snapshots

2016-08-25 Thread Stefan Richter (JIRA)

Stefan Richter created FLINK-4492:
-

 Summary: Cleanup files from canceled snapshots
 Key: FLINK-4492
 URL: https://issues.apache.org/jira/browse/FLINK-4492
 Project: Flink
  Issue Type: Bug
Reporter: Stefan Richter
Priority: Minor


Current checkpointing only closes CheckpointStateOutputStreams on cancel, but 
incomplete files are not properly deleted from the filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4489) Implement TaskManager's SlotManager

2016-08-25 Thread Till Rohrmann (JIRA)

Till Rohrmann created FLINK-4489:


 Summary: Implement TaskManager's SlotManager
 Key: FLINK-4489
 URL: https://issues.apache.org/jira/browse/FLINK-4489
 Project: Flink
  Issue Type: Sub-task
Reporter: Till Rohrmann


The {{SlotManager}} is responsible for managing the available slots on the 
{{TaskManager}}. This basically means to maintain the mapping between slots and 
the owning {{JobManagers}} and to offer tasks which run in the slots access to 
the owning {{JobManagers}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4488) Prevent cluster shutdown after job execution for non-detached jobs

2016-08-25 Thread Maximilian Michels (JIRA)

Maximilian Michels created FLINK-4488:
-

 Summary: Prevent cluster shutdown after job execution for 
non-detached jobs
 Key: FLINK-4488
 URL: https://issues.apache.org/jira/browse/FLINK-4488
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Affects Versions: 1.2.0, 1.1.1
Reporter: Maximilian Michels
Assignee: Maximilian Michels
Priority: Minor
 Fix For: 1.2.0, 1.1.2


In per-job mode, the Yarn cluster currently shuts down after the first 
interactively executed job. Users may want to execute multiple jobs in one Jar. 
I would suggest to use this mechanism only for jobs which run detached. For 
interactive jobs, shutdown of the cluster is additionally handled by the CLI 
which should be sufficient to ensure cluster shutdown. Cluster shutdown could 
only become a problem in case of a network partition to the cluster or outage 
of the CLI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4487) Need tools for managing the yarn-session better

Niels Basjes created FLINK-4487:
---

 Summary: Need tools for managing the yarn-session better
 Key: FLINK-4487
 URL: https://issues.apache.org/jira/browse/FLINK-4487
 Project: Flink
  Issue Type: Improvement
  Components: YARN Client
Affects Versions: 1.1.0
Reporter: Niels Basjes


There is a script yarn-session.sh which starts an empty JobManager on a Yarn 
cluster. 
Desired improvements:
# If there is already a yarn-session running then yarn-session does not start a 
new one (or it kills the old one?). Note that the file with ip/port may exist 
yet the corresponding JobManager may have been killed in an other way.
# A script that effectively lets me stop a yarn session and cleanup the file 
that contains the ip/port of this yarn session and the .flink directory on HDFS.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4486) JobManager not fully running when yarn-session.sh finishes

Niels Basjes created FLINK-4486:
---

 Summary: JobManager not fully running when yarn-session.sh finishes
 Key: FLINK-4486
 URL: https://issues.apache.org/jira/browse/FLINK-4486
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Affects Versions: 1.1.0
Reporter: Niels Basjes


I start a detached yarn-session.sh.
If the Yarn cluster is very busy then the yarn-session.sh script completes 
BEFORE all the task slots have been allocated. As a consequence I sometimes 
have a jobmanager without any task slots. Over time these task slots are 
assigned by the Yarn cluster but these are not available for the first job that 
is submitted.

As a consequence I have found that the first few tasks in my job fail with this 
error "Not enough free slots available to run the job.".

I think the desirable behavior is that yarn-session waits until the jobmanager 
is fully functional and capable of actually running the jobs.

{code}
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not 
enough free slots available to run the job. You can decrease the operator 
parallelism or increase the number of slots per TaskManager in the 
configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (Read prefix 
'4') -> Map (Map prefix '4') (8/10)) @ (unassigned) - [SCHEDULED] > with 
groupID < cd6c37df290564e603da908a8783a9bf > in sharing group < 
SlotSharingGroup [c0b6eff6ce93967182cdb6dfeae9359b, 
8b2c3b39f3a55adf9f123243ab03c9c1, 55fb94dd8a3e5f59a10dbbf5c4925db4, 
433b2e4a05a5e685b48c517249755a89, 8c74690c35454064e4815ac3756cdca2, 
4b4fbd24f3483030fd852b38ff2249c1, 5e36a56ea4dece18fe5ba04352d90dc8, 
cd6c37df290564e603da908a8783a9bf, 64eafa845087bee70735f7250df9994f, 
706a5d6fe48ae57724a00a9fce5dae8a, 7bee4297e0e839e53a153dfcbcca8624, 
21b58f7d408d237540ae7b4734f81a1d, b429b1ff338d9d73677f42717cfc0dbc, 
cc7491db641f557c6aa8c749ebc2de62, f61cbf0ae00331f67aaf60ace78b05aa, 
606f02ea9e0f4ad57f0cc0232dd70842] >. Resources available to scheduler: Number 
of instances=1, total number of slots=7, available slots=0
at 
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
at 
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
at 
org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:306)
at 
org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:454)
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:326)
at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:734)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1332)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-4485) Finished jobs in yarn session fill /tmp filesystem

Niels Basjes created FLINK-4485:
---

 Summary: Finished jobs in yarn session fill /tmp filesystem
 Key: FLINK-4485
 URL: https://issues.apache.org/jira/browse/FLINK-4485
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 1.1.0
Reporter: Niels Basjes
Priority: Blocker


On a Yarn cluster I start a yarn-session with a few containers and task slots.
Then I fire a 'large' number of Flink batch jobs in sequence against this yarn 
session. It is the exact same job (java code) yet it gets different parameters.

In this scenario it is exporting HBase tables to files in HDFS and the 
parameters are about which data from which tables and the name of the target 
directory.

After running several dozen jobs the jobs submission started to fail and we 
investigated.

We found that the cause was that on the Yarn node which was hosting the 
jobmanager the /tmp file system was full (4GB was 100% full).

How ever the output of {{du -hcs /tmp}} showed only 200MB in use.

We found that a very large file (we guess it is the jar of the job) was put in 
/tmp , used, deleted yet the file handle was not closed by the jobmanager.

As soon as we killed the jobmanager the disk space was freed.

See parts of the output we got from {{lsof}} below.

{code}
COMMAND PID  USER   FD  TYPE DEVICE  SIZE   
NODE NAME
java  15034   nbasjes  550r  REG 253,17  66219695
245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0003 
(deleted)
java  15034   nbasjes  551r  REG 253,17  66219695
252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0007 
(deleted)
java  15034   nbasjes  552r  REG 253,17  66219695
267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0012 
(deleted)
java  15034   nbasjes  553r  REG 253,17  66219695
250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0005 
(deleted)
java  15034   nbasjes  554r  REG 253,17  66219695
288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0018 
(deleted)
java  15034   nbasjes  555r  REG 253,17  66219695
298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0025 
(deleted)
java  15034   nbasjes  557r  REG 253,17  66219695
254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0008 
(deleted)
java  15034   nbasjes  558r  REG 253,17  66219695
292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0019 
(deleted)
java  15034   nbasjes  559r  REG 253,17  66219695
275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0013 
(deleted)
java  15034   nbasjes  560r  REG 253,17  66219695
159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0002 
(deleted)
java  15034   nbasjes  562r  REG 253,17  66219695
238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0001 
(deleted)
java  15034   nbasjes  568r  REG 253,17  66219695
246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0004 
(deleted)
java  15034   nbasjes  569r  REG 253,17  66219695
255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0009 
(deleted)
java  15034   nbasjes  571r  REG 253,17  66219695
299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0026 
(deleted)
java  15034   nbasjes  572r  REG 253,17  66219695
293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0020 
(deleted)
java  15034   nbasjes  574r  REG 253,17  66219695
256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0010 
(deleted)
java  15034   nbasjes  575r  REG 253,17  66219695
302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0029 
(deleted)
java  15034   nbasjes  576r  REG 253,17  66219695
294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0021 
(deleted)
java  15034   nbasjes  577r  REG 253,17  66219695
262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0011 
(deleted)
java  15034   nbasjes  578r  REG 253,17  66219695
251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0006 
(deleted)
java  15034   nbasjes  580r  REG 253,17  66219695
295

[jira] [Created] (FLINK-4484) FLIP-10: Unify Savepoints and Checkpoints

2016-08-25 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-4484:
--

 Summary: FLIP-10: Unify Savepoints and Checkpoints
 Key: FLINK-4484
 URL: https://issues.apache.org/jira/browse/FLINK-4484
 Project: Flink
  Issue Type: Improvement
Reporter: Ufuk Celebi


Super issue to track progress for 
[FLIP-10|https://cwiki.apache.org/confluence/display/FLINK/FLIP-10%3A+Unify+Checkpoints+and+Savepoints].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Python API for Fllink libraries

2016-08-25 Thread Chesnay Schepler

Helli Ivan,

i don't know why it is the way it is.

Regarding issues to work on: You should be able to go through the
transformations documentation and see which are not supported.

Regards,
Chesnay

On 21.08.2016 01:11, Ivan Mushketyk wrote:

Hi Chesnay,

Thank you for you repply.
Out of curiosity, do you know why Python API reception was *tumbleweed*?

Regarding the Python API, do you know what specifically should be done
there? I have some Python background so I was considering to contribute,
but I didn't find much tasks in the "Python" component:
https://issues.apache.org/jira/browse/FLINK-1926?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20%22Python%20API%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best regards,
Ivan.

On Fri, 19 Aug 2016 at 22:45 Chesnay Schepler wrote:

Hello,

I would say no, as the general reception of the Python API was
*tumbleweed* so far.

In my opinion this would just lead to a massive increase in code to
maintain; we would need at least 2-3 active long-term python contributors.
Especially so since ML, CEP and Table are afaik still in heavy development.

If anything, before thinking about porting the libraries to python it
would make more sense to implement a python streaming API.
Or maybe /finish/ porting the DataSet API...

Regards,
Chesnay

On 19.08.2016 22:07, Ivan Mushketyk wrote:

Hi Flink developers,

It seems to me that Flink has two important "selling points":

1. It has Java, Scala and Python APIs
2. I has a number of useful libraries (ML, Gelly, CEP, and Table)

But as far as I understand, currently users cannot use any of these
libraries using a Python API. It seems to be a gap worth filling.

What do you think about it? Does it make sense to add CEP/Gelly/ML/Table
Python APIs?

Best regards,
Ivan.

Re: Task manager processes crashing one after the other

Hi,

Sure I am sending the TM logs in priv.

Currently what I did was to bump the Rocks version to 4.9.0 let's see if
that helps.

Cheers,
Gyula

Till Rohrmann  ezt írta (időpont: 2016. aug. 25., Cs,
10:35):

> Hi Gyula,
>
> I haven't seen this problem before. Do you have the logs of the failed TMs
> so that we have some more context what was going on?
>
> Cheers,
> Till
>
> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra  wrote:
>
> > Hi guys,
> >
> > For quite some time now we fairly frequently experience a task manager
> > crashes around the time new streaming jobs are deployed. We use RocksDB
> > backend so this might be related.
> >
> > We tried changing the GC from G1 to CMS that didnt help.
> >
> > Yesterday for instance 6 task managers crashed one ofter the other with
> > similar errors:
> >
> > *** Error in `java': double free or corruption (!prev):
> 0x7fac0414d760
> > ***
> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 ***
> > *** Error in `java': double free or corruption (!prev):
> 0x7f15247f9a90
> > ***
> > ...
> >
> > Does anyone have any clue what might cause this or how to debug?
> > This is very a critical issue :(
> >
> > Cheers,
> > Gyula
> >
>

Re: Task manager processes crashing one after the other

2016-08-25 Thread Till Rohrmann

Hi Gyula,

I haven't seen this problem before. Do you have the logs of the failed TMs
so that we have some more context what was going on?

Cheers,
Till

On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra  wrote:

> Hi guys,
>
> For quite some time now we fairly frequently experience a task manager
> crashes around the time new streaming jobs are deployed. We use RocksDB
> backend so this might be related.
>
> We tried changing the GC from G1 to CMS that didnt help.
>
> Yesterday for instance 6 task managers crashed one ofter the other with
> similar errors:
>
> *** Error in `java': double free or corruption (!prev): 0x7fac0414d760
> ***
> *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 ***
> *** Error in `java': double free or corruption (!prev): 0x7f15247f9a90
> ***
> ...
>
> Does anyone have any clue what might cause this or how to debug?
> This is very a critical issue :(
>
> Cheers,
> Gyula
>

Task manager processes crashing one after the other