Re: Additional project downloads
Also the metrics reporters. The circumstances for this request are that I wanted to use the metrics reporters for 1.1.1 and had to go looking on Maven Central (also had to download dependencies, which may be an issue with packaging). I'm also looking to update the Gelly documentation to walkthrough running the examples and was looking to avoid pointing users to Maven Central. This sounds like a nice compromise, to include the extra jars in the download but off the default classpath. The distinction between lib and libraries may be subtle and not as applicable to the other jars. Would something like "opt" be too obtuse? On Thu, Aug 25, 2016 at 1:55 PM, Stephan Ewenwrote: > The downloads would be just the components' jar files, or everything? > > At some point, someone suggested to add the jars of all libraries (gelly, > ml, ...) and connectors into the download tarball: > > - bin/ > - conf/ > - lib/ (core flink runtime and apis) > - libraries/ >+-gelly/ >+-CEP/ >+-... > - examples/ > > Would that be interesting to people? > > > On Wed, Aug 24, 2016 at 5:26 PM, Till Rohrmann > wrote: > > > I agree that it would be good to offer these kind of convenience download > > links. > > > > On Wed, Aug 24, 2016 at 5:25 PM, Robert Metzger > > wrote: > > > > > Maybe we should put a link to maven central. We could parameterize the > > link > > > so that it always links to the current release linked on our downloads > > > page. > > > > > > On Wed, Aug 24, 2016 at 5:04 PM, Greg Hogan > wrote: > > > > > > > Hi, > > > > > > > > Should Flink add-ons such as CEP, Gelly, ML, and the optional Metrics > > > > Reporters be available from the download page? Is the alternative to > > > direct > > > > users to Maven Central? > > > > > > > > Greg > > > > > > > > > >
[jira] [Created] (FLINK-4502) Cassandra connector documentation has misleading consistency guarantees
Elias Levy created FLINK-4502: - Summary: Cassandra connector documentation has misleading consistency guarantees Key: FLINK-4502 URL: https://issues.apache.org/jira/browse/FLINK-4502 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Elias Levy The Cassandra connector documentation states that "enableWriteAheadLog() is an optional method, that allows exactly-once processing for non-deterministic algorithms." This claim appears to be false. >From what I gather, the write ahead log feature of the connector works as >follows: - The sink is replaced with a stateful operator that writes incoming messages to the state backend based on checkpoint they belong in. - When the operator is notified that a Flink checkpoint has been completed it, for each set of checkpoints older than and including the committed one: * reads its messages from the state backend * writes them to Cassandra * records that it has committed them to Cassandra for the specific checkpoint and operator instance * and erases them from the state backend. This process attempts to avoid resubmitting queries to Cassandra that would otherwise occur when recovering a job from a checkpoint and having messages replayed. Alas, this does not guarantee exactly once semantics at the sink. The writes to Cassandra that occur when the operator is notified that checkpoint is completed are not atomic and they are potentially non-idempotent. If the job dies while writing to Cassandra or before committing the checkpoint via committer, queries will be replayed when the job recovers. Thus the documentation appear to be incorrect in stating this provides exactly-once semantics. There also seems to be an issue in GenericWriteAheadSink's notifyOfCompletedCheckpoint which may result in incorrect output. If sendValues returns false because a write failed, instead of bailing, it simply moves on to the next checkpoint to commit if there is one, keeping the previous one around to try again later. But that can result in newer data being overwritten with older data when the previous checkpoint is retried. Although given that CassandraCommitter implements isCheckpointCommitted as checkpointID <= this.lastCommittedCheckpointID, it actually means that when it goes back to try the uncommitted older checkpoint it will consider it committed, even though some of its data may not have been written out, and the data will be discarded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4501) Cassandra sink can lose messages
Elias Levy created FLINK-4501: - Summary: Cassandra sink can lose messages Key: FLINK-4501 URL: https://issues.apache.org/jira/browse/FLINK-4501 Project: Flink Issue Type: Bug Components: Cassandra Connector Affects Versions: 1.1.0 Reporter: Elias Levy The problem is the same as I pointed out with the Kafka producer sink (FLINK-4027). The CassandraTupleSink's send() and CassandraPojoSink's send() both send data asynchronously to Cassandra and record whether an error occurs via a future callback. But CassandraSinkBase does not implement Checkpointed, so it can't stop checkpoint from happening even though the are Cassandra queries in flight from the checkpoint that may fail. If they do fail, they would subsequently not be replayed when the job recovered, and would thus be lost. In addition, CassandraSinkBase's close should check whether there is a pending exception and throw it, rather than silently close. It should also wait for any pending async queries to complete and check their status before closing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4500) Cassandra sink can lose messages
Elias Levy created FLINK-4500: - Summary: Cassandra sink can lose messages Key: FLINK-4500 URL: https://issues.apache.org/jira/browse/FLINK-4500 Project: Flink Issue Type: Bug Components: Cassandra Connector Affects Versions: 1.1.0 Reporter: Elias Levy The problem is the same as I pointed out with the Kafka producer sink (FLINK-4027). The CassandraTupleSink's send() and CassandraPojoSink's send() both send data asynchronously to Cassandra and record whether an error occurs via a future callback. But CassandraSinkBase does not implement Checkpointed, so it can't stop checkpoint from happening even though the are Cassandra queries in flight from the checkpoint that may fail. If they do fail, they would subsequently not be replayed when the job recovered, and would thus be lost. In addition, CassandraSinkBase's close should check whether there is a pending exception and throw it, rather than silently close. It should also wait for any pending async queries to complete and check their status before closing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Task manager processes crashing one after the other
Stephan, I ported the fix for the concurrency issue from the Flink commit so now that should be fine. I ran some fail/restore tests and that specific issue hasn't appeared again. However I now get many segfaults in the initializeForJob method where the RocksDb instance is opened. Just for the record this is the same exact code as we have in Flink now.: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f12b018f51f, pid=12576, tid=139668190197504 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 ) # Problematic frame: # C [libc.so.6+0x7b51f] ... Stack: [0x7f0708ccf000,0x7f0708dd], sp=0x7f0708dccd20, free space=1015k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libc.so.6+0x7b51f] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;I)Ljava/util/List;+0 j org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23 j com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.initializeForJob... And this happens fairly frequently when the jobs are restarting after failure. Cheers, Gyula Gyula Fóraezt írta (időpont: 2016. aug. 25., Cs, 19:07): > Yes seems like that, I remember the fix in Flink. I apparently made a > mistake somewhere in our code :) > > Thanks, > Gyula > > On Thu, Aug 25, 2016, 18:59 Stephan Ewen wrote: > >> We saw some crashes in earlier versions when native handles in RocksDB >> (even for config option objects) were manually and too eagerly released. >> >> Maybe you have a similar issue here? >> >> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra wrote: >> >> > Hi, >> > This seems to be a sneaky concurrency issue in our custom statebackend >> > implementation. >> > >> > I made some changes, will keep you posted. >> > >> > Cheers, >> > Gyula >> > >> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra wrote: >> > >> > > Hi, >> > > >> > > Sure I am sending the TM logs in priv. >> > > >> > > Currently what I did was to bump the Rocks version to 4.9.0 let's see >> if >> > > that helps. >> > > >> > > Cheers, >> > > Gyula >> > > >> > > Till Rohrmann ezt írta (időpont: 2016. aug. >> 25., >> > > Cs, 10:35): >> > > >> > >> Hi Gyula, >> > >> >> > >> I haven't seen this problem before. Do you have the logs of the >> failed >> > TMs >> > >> so that we have some more context what was going on? >> > >> >> > >> Cheers, >> > >> Till >> > >> >> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra >> wrote: >> > >> >> > >> > Hi guys, >> > >> > >> > >> > For quite some time now we fairly frequently experience a task >> manager >> > >> > crashes around the time new streaming jobs are deployed. We use >> > RocksDB >> > >> > backend so this might be related. >> > >> > >> > >> > We tried changing the GC from G1 to CMS that didnt help. >> > >> > >> > >> > Yesterday for instance 6 task managers crashed one ofter the other >> > with >> > >> > similar errors: >> > >> > >> > >> > *** Error in `java': double free or corruption (!prev): >> > >> 0x7fac0414d760 >> > >> > *** >> > >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 >> *** >> > >> > *** Error in `java': double free or corruption (!prev): >> > >> 0x7f15247f9a90 >> > >> > *** >> > >> > ... >> > >> > >> > >> > Does anyone have any clue what might cause this or how to debug? >> > >> > This is very a critical issue :( >> > >> > >> > >> > Cheers, >> > >> > Gyula >> > >> > >> > >> >> > > >> > >> >
Fwd: Enabling Encryption between slaves in Flink
Hi, I have a requirement that all the data flowing between the task managers should be encrypted, is there a way in Flink to do that. Can we use the configuration file to enable this as follows : http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#Remoting_Sample or do we need to add the above configurations in code here : https://github.com/apache/flink/blob/master/flink- runtime/src/main/scala/org/apache/flink/runtime/akka/AkkaUtils.scala I have looked at this mail thread , but wanted to get clear understanding of how we can achieve it http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_ v0pqfvfto478ft5cbgm-bf-do742gouz528bw7vrj...@mail.gmail.com%3E Regards, Vinay Patil
Re: Additional project downloads
The downloads would be just the components' jar files, or everything? At some point, someone suggested to add the jars of all libraries (gelly, ml, ...) and connectors into the download tarball: - bin/ - conf/ - lib/ (core flink runtime and apis) - libraries/ +-gelly/ +-CEP/ +-... - examples/ Would that be interesting to people? On Wed, Aug 24, 2016 at 5:26 PM, Till Rohrmannwrote: > I agree that it would be good to offer these kind of convenience download > links. > > On Wed, Aug 24, 2016 at 5:25 PM, Robert Metzger > wrote: > > > Maybe we should put a link to maven central. We could parameterize the > link > > so that it always links to the current release linked on our downloads > > page. > > > > On Wed, Aug 24, 2016 at 5:04 PM, Greg Hogan wrote: > > > > > Hi, > > > > > > Should Flink add-ons such as CEP, Gelly, ML, and the optional Metrics > > > Reporters be available from the download page? Is the alternative to > > direct > > > users to Maven Central? > > > > > > Greg > > > > > >
[jira] [Created] (FLINK-4499) Introduce findbugs maven plugin
Ted Yu created FLINK-4499: - Summary: Introduce findbugs maven plugin Key: FLINK-4499 URL: https://issues.apache.org/jira/browse/FLINK-4499 Project: Flink Issue Type: Improvement Reporter: Ted Yu As suggested by Stephan in FLINK-4482, this issue is to add findbugs-maven-plugin into the build process so that we can detect lack of proper locking and other defects automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Task manager processes crashing one after the other
Yes seems like that, I remember the fix in Flink. I apparently made a mistake somewhere in our code :) Thanks, Gyula On Thu, Aug 25, 2016, 18:59 Stephan Ewenwrote: > We saw some crashes in earlier versions when native handles in RocksDB > (even for config option objects) were manually and too eagerly released. > > Maybe you have a similar issue here? > > On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra wrote: > > > Hi, > > This seems to be a sneaky concurrency issue in our custom statebackend > > implementation. > > > > I made some changes, will keep you posted. > > > > Cheers, > > Gyula > > > > On Thu, Aug 25, 2016, 10:54 Gyula Fóra wrote: > > > > > Hi, > > > > > > Sure I am sending the TM logs in priv. > > > > > > Currently what I did was to bump the Rocks version to 4.9.0 let's see > if > > > that helps. > > > > > > Cheers, > > > Gyula > > > > > > Till Rohrmann ezt írta (időpont: 2016. aug. > 25., > > > Cs, 10:35): > > > > > >> Hi Gyula, > > >> > > >> I haven't seen this problem before. Do you have the logs of the failed > > TMs > > >> so that we have some more context what was going on? > > >> > > >> Cheers, > > >> Till > > >> > > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra > wrote: > > >> > > >> > Hi guys, > > >> > > > >> > For quite some time now we fairly frequently experience a task > manager > > >> > crashes around the time new streaming jobs are deployed. We use > > RocksDB > > >> > backend so this might be related. > > >> > > > >> > We tried changing the GC from G1 to CMS that didnt help. > > >> > > > >> > Yesterday for instance 6 task managers crashed one ofter the other > > with > > >> > similar errors: > > >> > > > >> > *** Error in `java': double free or corruption (!prev): > > >> 0x7fac0414d760 > > >> > *** > > >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** > > >> > *** Error in `java': double free or corruption (!prev): > > >> 0x7f15247f9a90 > > >> > *** > > >> > ... > > >> > > > >> > Does anyone have any clue what might cause this or how to debug? > > >> > This is very a critical issue :( > > >> > > > >> > Cheers, > > >> > Gyula > > >> > > > >> > > > > > >
Re: Task manager processes crashing one after the other
We saw some crashes in earlier versions when native handles in RocksDB (even for config option objects) were manually and too eagerly released. Maybe you have a similar issue here? On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fórawrote: > Hi, > This seems to be a sneaky concurrency issue in our custom statebackend > implementation. > > I made some changes, will keep you posted. > > Cheers, > Gyula > > On Thu, Aug 25, 2016, 10:54 Gyula Fóra wrote: > > > Hi, > > > > Sure I am sending the TM logs in priv. > > > > Currently what I did was to bump the Rocks version to 4.9.0 let's see if > > that helps. > > > > Cheers, > > Gyula > > > > Till Rohrmann ezt írta (időpont: 2016. aug. 25., > > Cs, 10:35): > > > >> Hi Gyula, > >> > >> I haven't seen this problem before. Do you have the logs of the failed > TMs > >> so that we have some more context what was going on? > >> > >> Cheers, > >> Till > >> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra wrote: > >> > >> > Hi guys, > >> > > >> > For quite some time now we fairly frequently experience a task manager > >> > crashes around the time new streaming jobs are deployed. We use > RocksDB > >> > backend so this might be related. > >> > > >> > We tried changing the GC from G1 to CMS that didnt help. > >> > > >> > Yesterday for instance 6 task managers crashed one ofter the other > with > >> > similar errors: > >> > > >> > *** Error in `java': double free or corruption (!prev): > >> 0x7fac0414d760 > >> > *** > >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** > >> > *** Error in `java': double free or corruption (!prev): > >> 0x7f15247f9a90 > >> > *** > >> > ... > >> > > >> > Does anyone have any clue what might cause this or how to debug? > >> > This is very a critical issue :( > >> > > >> > Cheers, > >> > Gyula > >> > > >> > > >
[jira] [Created] (FLINK-4498) Better Cassandra sink documentation
Elias Levy created FLINK-4498: - Summary: Better Cassandra sink documentation Key: FLINK-4498 URL: https://issues.apache.org/jira/browse/FLINK-4498 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: Elias Levy The Cassandra sink documentation is somewhat muddled and could be improved. For instance, the fact that is only supports tuples and POJO's that use DataStax Mapper annotations is only mentioned in passing, and it is not clear that the reference to tuples only applies to Flink Java tuples and not Scala tuples. The documentation also does not mention that setQuery() is only necessary for tuple streams. It would be good to have an example of a POJO stream with the DataStax annotations. The explanation of the write ahead log could use some cleaning up to clarify when it is appropriate to use, ideally with an example. Maybe this would be best as a blog post to expand on the type of non-deterministic streams this applies to. It would also be useful to mention that tuple elements will be mapped to Cassandra columns using the Datastax Java driver's default encoders, which are somewhat limited (e.g. to write to a blob column the type in the tuple must be a java.nio.ByteBuffer and not just a byte[]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Task manager processes crashing one after the other
Hi, This seems to be a sneaky concurrency issue in our custom statebackend implementation. I made some changes, will keep you posted. Cheers, Gyula On Thu, Aug 25, 2016, 10:54 Gyula Fórawrote: > Hi, > > Sure I am sending the TM logs in priv. > > Currently what I did was to bump the Rocks version to 4.9.0 let's see if > that helps. > > Cheers, > Gyula > > Till Rohrmann ezt írta (időpont: 2016. aug. 25., > Cs, 10:35): > >> Hi Gyula, >> >> I haven't seen this problem before. Do you have the logs of the failed TMs >> so that we have some more context what was going on? >> >> Cheers, >> Till >> >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra wrote: >> >> > Hi guys, >> > >> > For quite some time now we fairly frequently experience a task manager >> > crashes around the time new streaming jobs are deployed. We use RocksDB >> > backend so this might be related. >> > >> > We tried changing the GC from G1 to CMS that didnt help. >> > >> > Yesterday for instance 6 task managers crashed one ofter the other with >> > similar errors: >> > >> > *** Error in `java': double free or corruption (!prev): >> 0x7fac0414d760 >> > *** >> > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** >> > *** Error in `java': double free or corruption (!prev): >> 0x7f15247f9a90 >> > *** >> > ... >> > >> > Does anyone have any clue what might cause this or how to debug? >> > This is very a critical issue :( >> > >> > Cheers, >> > Gyula >> > >> >
[jira] [Created] (FLINK-4497) Add support for Scala tuples and case classes to Cassandra sink
Elias Levy created FLINK-4497: - Summary: Add support for Scala tuples and case classes to Cassandra sink Key: FLINK-4497 URL: https://issues.apache.org/jira/browse/FLINK-4497 Project: Flink Issue Type: Improvement Components: Cassandra Connector Affects Versions: 1.1.0 Reporter: Elias Levy The new Cassandra sink only supports streams of Flink Java tuples and Java POJOs that have been annotated for use by Datastax Mapper. The sink should be extended to support Scala types and case classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4496) Refactor the TimeServiceProvider to take a Trigerable instead of a Runnable.
Kostas Kloudas created FLINK-4496: - Summary: Refactor the TimeServiceProvider to take a Trigerable instead of a Runnable. Key: FLINK-4496 URL: https://issues.apache.org/jira/browse/FLINK-4496 Project: Flink Issue Type: Sub-task Reporter: Kostas Kloudas Assignee: Kostas Kloudas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4495) Running multiple jobs on yarn (without yarn-session)
Niels Basjes created FLINK-4495: --- Summary: Running multiple jobs on yarn (without yarn-session) Key: FLINK-4495 URL: https://issues.apache.org/jira/browse/FLINK-4495 Project: Flink Issue Type: Improvement Components: YARN Client Reporter: Niels Basjes I created a small application that needs to run multiple (batch) jobs on Yarn and then terminate. I essentially do right now the following: flink run -m yarn-cluster -yn 10 bla.jar ... And in my main I do foreach thing I need to do { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); env. ... define the batch job. env.execute } In the second job I submit I get an exception: {code} java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:184) at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:202) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:389) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:376) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:61) at com.bol.tools.hbase.export.Main.runSingleTable(Main.java:220) at com.bol.tools.hbase.export.Main.run(Main.java:81) at com.bol.tools.hbase.export.Main.main(Main.java:42) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:509) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:331) at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:775) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:251) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:995) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:992) at org.apache.flink.runtime.security.SecurityUtils$1.run(SecurityUtils.java:56) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.flink.runtime.security.SecurityUtils.runSecured(SecurityUtils.java:53) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:992) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1046) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at scala.concurrent.Await.result(package.scala) at org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:182) ... 25 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4494) Expose the TimeServiceProvider from the Task to each Operator.
Kostas Kloudas created FLINK-4494: - Summary: Expose the TimeServiceProvider from the Task to each Operator. Key: FLINK-4494 URL: https://issues.apache.org/jira/browse/FLINK-4494 Project: Flink Issue Type: Bug Reporter: Kostas Kloudas This change aims at simplifying the {{StreamTask}} class by exposing directly the {{TimeServiceProvider}} to the operators being executed. This implies removing the {{registerTimer()}} and {{getCurrentProcessingTime()}} methods from the {{StreamTask}}. Now, to register a timer and query the time, each operator will be able to get the {{TimeServiceProvider}} and call the corresponding methods directly on it. In addition, this will simplify many of the tests which now implement their own time providers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4493) Unify the snapshot output format for keyed-state backends
Stefan Richter created FLINK-4493: - Summary: Unify the snapshot output format for keyed-state backends Key: FLINK-4493 URL: https://issues.apache.org/jira/browse/FLINK-4493 Project: Flink Issue Type: Improvement Reporter: Stefan Richter Priority: Minor We could unify the output format for keyed-state backends implementations, e.g. based on RocksDB and Heap, to write a single, common output format. For example, this would allow us to restore a state that was previously kept in RocksDB on a heap-located backend and vice versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4490) Decouple Slot and Instance
Kurt Young created FLINK-4490: - Summary: Decouple Slot and Instance Key: FLINK-4490 URL: https://issues.apache.org/jira/browse/FLINK-4490 Project: Flink Issue Type: Sub-task Reporter: Kurt Young Currently, {{Slot}} and {{Instance}} holds each other. For {{Instance}} holding {{Slot}}, it makes sense because it reflects how many resources it can provide and how many are using. But it's not very necessary for {{Slot}} to hold {{Instance}} which it belongs to. It only needs to hold some connection information and gateway to talk to. Another downside for {{Slot}} holding {{Instance}} is that {{Instance}} actually contains some allocate/de-allocation logicals, it will be difficult if we want to do some allocation refactor without letting {{Slot}} noticed. We should abstract the connection information of {{Instance}} to let {{Slot}} holds. (Actually we have {{InstanceConnectionInfo}} now, but lacks of instance's akka gateway, maybe we can just adding the akka gateway to the {{InstanceConnectionInfo}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4491) Handle index.number_of_shards in the ES connector
Flavio Pompermaier created FLINK-4491: - Summary: Handle index.number_of_shards in the ES connector Key: FLINK-4491 URL: https://issues.apache.org/jira/browse/FLINK-4491 Project: Flink Issue Type: Improvement Components: Streaming Connectors Affects Versions: 1.1.0 Reporter: Flavio Pompermaier Priority: Minor At the moment is not possible to configure the number of shards if an index does not already exists on the Elasticsearch cluster. It could be a great improvement to handle the index.number_of_shards (passed in the configuration object). E.g.: {code:java} Mapconfig = Maps.newHashMap(); config.put("bulk.flush.max.actions", "1"); config.put("cluster.name", "my-cluster-name"); config.put("index.number_of_shards", "1"); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4492) Cleanup files from canceled snapshots
Stefan Richter created FLINK-4492: - Summary: Cleanup files from canceled snapshots Key: FLINK-4492 URL: https://issues.apache.org/jira/browse/FLINK-4492 Project: Flink Issue Type: Bug Reporter: Stefan Richter Priority: Minor Current checkpointing only closes CheckpointStateOutputStreams on cancel, but incomplete files are not properly deleted from the filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4489) Implement TaskManager's SlotManager
Till Rohrmann created FLINK-4489: Summary: Implement TaskManager's SlotManager Key: FLINK-4489 URL: https://issues.apache.org/jira/browse/FLINK-4489 Project: Flink Issue Type: Sub-task Reporter: Till Rohrmann The {{SlotManager}} is responsible for managing the available slots on the {{TaskManager}}. This basically means to maintain the mapping between slots and the owning {{JobManagers}} and to offer tasks which run in the slots access to the owning {{JobManagers}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4488) Prevent cluster shutdown after job execution for non-detached jobs
Maximilian Michels created FLINK-4488: - Summary: Prevent cluster shutdown after job execution for non-detached jobs Key: FLINK-4488 URL: https://issues.apache.org/jira/browse/FLINK-4488 Project: Flink Issue Type: Bug Components: YARN Client Affects Versions: 1.2.0, 1.1.1 Reporter: Maximilian Michels Assignee: Maximilian Michels Priority: Minor Fix For: 1.2.0, 1.1.2 In per-job mode, the Yarn cluster currently shuts down after the first interactively executed job. Users may want to execute multiple jobs in one Jar. I would suggest to use this mechanism only for jobs which run detached. For interactive jobs, shutdown of the cluster is additionally handled by the CLI which should be sufficient to ensure cluster shutdown. Cluster shutdown could only become a problem in case of a network partition to the cluster or outage of the CLI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4487) Need tools for managing the yarn-session better
Niels Basjes created FLINK-4487: --- Summary: Need tools for managing the yarn-session better Key: FLINK-4487 URL: https://issues.apache.org/jira/browse/FLINK-4487 Project: Flink Issue Type: Improvement Components: YARN Client Affects Versions: 1.1.0 Reporter: Niels Basjes There is a script yarn-session.sh which starts an empty JobManager on a Yarn cluster. Desired improvements: # If there is already a yarn-session running then yarn-session does not start a new one (or it kills the old one?). Note that the file with ip/port may exist yet the corresponding JobManager may have been killed in an other way. # A script that effectively lets me stop a yarn session and cleanup the file that contains the ip/port of this yarn session and the .flink directory on HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4486) JobManager not fully running when yarn-session.sh finishes
Niels Basjes created FLINK-4486: --- Summary: JobManager not fully running when yarn-session.sh finishes Key: FLINK-4486 URL: https://issues.apache.org/jira/browse/FLINK-4486 Project: Flink Issue Type: Bug Components: YARN Client Affects Versions: 1.1.0 Reporter: Niels Basjes I start a detached yarn-session.sh. If the Yarn cluster is very busy then the yarn-session.sh script completes BEFORE all the task slots have been allocated. As a consequence I sometimes have a jobmanager without any task slots. Over time these task slots are assigned by the Yarn cluster but these are not available for the first job that is submitted. As a consequence I have found that the first few tasks in my job fail with this error "Not enough free slots available to run the job.". I think the desirable behavior is that yarn-session waits until the jobmanager is fully functional and capable of actually running the jobs. {code} org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (Read prefix '4') -> Map (Map prefix '4') (8/10)) @ (unassigned) - [SCHEDULED] > with groupID < cd6c37df290564e603da908a8783a9bf > in sharing group < SlotSharingGroup [c0b6eff6ce93967182cdb6dfeae9359b, 8b2c3b39f3a55adf9f123243ab03c9c1, 55fb94dd8a3e5f59a10dbbf5c4925db4, 433b2e4a05a5e685b48c517249755a89, 8c74690c35454064e4815ac3756cdca2, 4b4fbd24f3483030fd852b38ff2249c1, 5e36a56ea4dece18fe5ba04352d90dc8, cd6c37df290564e603da908a8783a9bf, 64eafa845087bee70735f7250df9994f, 706a5d6fe48ae57724a00a9fce5dae8a, 7bee4297e0e839e53a153dfcbcca8624, 21b58f7d408d237540ae7b4734f81a1d, b429b1ff338d9d73677f42717cfc0dbc, cc7491db641f557c6aa8c749ebc2de62, f61cbf0ae00331f67aaf60ace78b05aa, 606f02ea9e0f4ad57f0cc0232dd70842] >. Resources available to scheduler: Number of instances=1, total number of slots=7, available slots=0 at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:306) at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:454) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:326) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:734) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1332) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-4485) Finished jobs in yarn session fill /tmp filesystem
Niels Basjes created FLINK-4485: --- Summary: Finished jobs in yarn session fill /tmp filesystem Key: FLINK-4485 URL: https://issues.apache.org/jira/browse/FLINK-4485 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 1.1.0 Reporter: Niels Basjes Priority: Blocker On a Yarn cluster I start a yarn-session with a few containers and task slots. Then I fire a 'large' number of Flink batch jobs in sequence against this yarn session. It is the exact same job (java code) yet it gets different parameters. In this scenario it is exporting HBase tables to files in HDFS and the parameters are about which data from which tables and the name of the target directory. After running several dozen jobs the jobs submission started to fail and we investigated. We found that the cause was that on the Yarn node which was hosting the jobmanager the /tmp file system was full (4GB was 100% full). How ever the output of {{du -hcs /tmp}} showed only 200MB in use. We found that a very large file (we guess it is the jar of the job) was put in /tmp , used, deleted yet the file handle was not closed by the jobmanager. As soon as we killed the jobmanager the disk space was freed. See parts of the output we got from {{lsof}} below. {code} COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME java 15034 nbasjes 550r REG 253,17 66219695 245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0003 (deleted) java 15034 nbasjes 551r REG 253,17 66219695 252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0007 (deleted) java 15034 nbasjes 552r REG 253,17 66219695 267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0012 (deleted) java 15034 nbasjes 553r REG 253,17 66219695 250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0005 (deleted) java 15034 nbasjes 554r REG 253,17 66219695 288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0018 (deleted) java 15034 nbasjes 555r REG 253,17 66219695 298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0025 (deleted) java 15034 nbasjes 557r REG 253,17 66219695 254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0008 (deleted) java 15034 nbasjes 558r REG 253,17 66219695 292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0019 (deleted) java 15034 nbasjes 559r REG 253,17 66219695 275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0013 (deleted) java 15034 nbasjes 560r REG 253,17 66219695 159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0002 (deleted) java 15034 nbasjes 562r REG 253,17 66219695 238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0001 (deleted) java 15034 nbasjes 568r REG 253,17 66219695 246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0004 (deleted) java 15034 nbasjes 569r REG 253,17 66219695 255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0009 (deleted) java 15034 nbasjes 571r REG 253,17 66219695 299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0026 (deleted) java 15034 nbasjes 572r REG 253,17 66219695 293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0020 (deleted) java 15034 nbasjes 574r REG 253,17 66219695 256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0010 (deleted) java 15034 nbasjes 575r REG 253,17 66219695 302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0029 (deleted) java 15034 nbasjes 576r REG 253,17 66219695 294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0021 (deleted) java 15034 nbasjes 577r REG 253,17 66219695 262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0011 (deleted) java 15034 nbasjes 578r REG 253,17 66219695 251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-0006 (deleted) java 15034 nbasjes 580r REG 253,17 66219695 295
[jira] [Created] (FLINK-4484) FLIP-10: Unify Savepoints and Checkpoints
Ufuk Celebi created FLINK-4484: -- Summary: FLIP-10: Unify Savepoints and Checkpoints Key: FLINK-4484 URL: https://issues.apache.org/jira/browse/FLINK-4484 Project: Flink Issue Type: Improvement Reporter: Ufuk Celebi Super issue to track progress for [FLIP-10|https://cwiki.apache.org/confluence/display/FLINK/FLIP-10%3A+Unify+Checkpoints+and+Savepoints]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Python API for Fllink libraries
Helli Ivan, i don't know why it is the way it is. Regarding issues to work on: You should be able to go through the transformations documentation and see which are not supported. Regards, Chesnay On 21.08.2016 01:11, Ivan Mushketyk wrote: Hi Chesnay, Thank you for you repply. Out of curiosity, do you know why Python API reception was *tumbleweed*? Regarding the Python API, do you know what specifically should be done there? I have some Python background so I was considering to contribute, but I didn't find much tasks in the "Python" component: https://issues.apache.org/jira/browse/FLINK-1926?jql=project%20%3D%20FLINK%20AND%20component%20%3D%20%22Python%20API%22%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC Best regards, Ivan. On Fri, 19 Aug 2016 at 22:45 Chesnay Scheplerwrote: Hello, I would say no, as the general reception of the Python API was *tumbleweed* so far. In my opinion this would just lead to a massive increase in code to maintain; we would need at least 2-3 active long-term python contributors. Especially so since ML, CEP and Table are afaik still in heavy development. If anything, before thinking about porting the libraries to python it would make more sense to implement a python streaming API. Or maybe /finish/ porting the DataSet API... Regards, Chesnay On 19.08.2016 22:07, Ivan Mushketyk wrote: Hi Flink developers, It seems to me that Flink has two important "selling points": 1. It has Java, Scala and Python APIs 2. I has a number of useful libraries (ML, Gelly, CEP, and Table) But as far as I understand, currently users cannot use any of these libraries using a Python API. It seems to be a gap worth filling. What do you think about it? Does it make sense to add CEP/Gelly/ML/Table Python APIs? Best regards, Ivan.
Re: Task manager processes crashing one after the other
Hi, Sure I am sending the TM logs in priv. Currently what I did was to bump the Rocks version to 4.9.0 let's see if that helps. Cheers, Gyula Till Rohrmannezt írta (időpont: 2016. aug. 25., Cs, 10:35): > Hi Gyula, > > I haven't seen this problem before. Do you have the logs of the failed TMs > so that we have some more context what was going on? > > Cheers, > Till > > On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra wrote: > > > Hi guys, > > > > For quite some time now we fairly frequently experience a task manager > > crashes around the time new streaming jobs are deployed. We use RocksDB > > backend so this might be related. > > > > We tried changing the GC from G1 to CMS that didnt help. > > > > Yesterday for instance 6 task managers crashed one ofter the other with > > similar errors: > > > > *** Error in `java': double free or corruption (!prev): > 0x7fac0414d760 > > *** > > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** > > *** Error in `java': double free or corruption (!prev): > 0x7f15247f9a90 > > *** > > ... > > > > Does anyone have any clue what might cause this or how to debug? > > This is very a critical issue :( > > > > Cheers, > > Gyula > > >
Re: Task manager processes crashing one after the other
Hi Gyula, I haven't seen this problem before. Do you have the logs of the failed TMs so that we have some more context what was going on? Cheers, Till On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fórawrote: > Hi guys, > > For quite some time now we fairly frequently experience a task manager > crashes around the time new streaming jobs are deployed. We use RocksDB > backend so this might be related. > > We tried changing the GC from G1 to CMS that didnt help. > > Yesterday for instance 6 task managers crashed one ofter the other with > similar errors: > > *** Error in `java': double free or corruption (!prev): 0x7fac0414d760 > *** > *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** > *** Error in `java': double free or corruption (!prev): 0x7f15247f9a90 > *** > ... > > Does anyone have any clue what might cause this or how to debug? > This is very a critical issue :( > > Cheers, > Gyula >
Task manager processes crashing one after the other
Hi guys, For quite some time now we fairly frequently experience a task manager crashes around the time new streaming jobs are deployed. We use RocksDB backend so this might be related. We tried changing the GC from G1 to CMS that didnt help. Yesterday for instance 6 task managers crashed one ofter the other with similar errors: *** Error in `java': double free or corruption (!prev): 0x7fac0414d760 *** *** Error in `java': free(): invalid pointer: 0x7f8dcc0026c0 *** *** Error in `java': double free or corruption (!prev): 0x7f15247f9a90 *** ... Does anyone have any clue what might cause this or how to debug? This is very a critical issue :( Cheers, Gyula