Re: [DISCUSS] Proposal to support disk spilling in HeapKeyedStateBackend

2019-05-29 Thread Stefan Richter
Hi Yu, Sorry for the late reaction. As already discussed internally, I think this is a very good proposal and design that can help to improve a major limitation of the current state backend. I think that most discussion is happening in the design doc and I left my comments there. Looking

Re: RocksDB native checkpoint time

2019-05-03 Thread Stefan Richter
Hi, out of curiosity, does it happen with jobs that have a large number of states (column groups) or also for jobs with few column groups and just “big state”? Best, Stefan > On 3. May 2019, at 11:04, Piotr Nowojski wrote: > > Hi Gyula, > > Have you read our tuning guide? >

Re: kafka partitions, data locality

2019-04-26 Thread Stefan Richter
Hi Sergey, The point why this I flagged as beta is actually less about stability but more about the fact that this is supposed to be more of a "power user" feature because bad things can happen if your data is not 100% correctly partitioned in the same way as Flink would partition it. This is

Re: Fast restart of a job with a large state

2019-04-18 Thread Stefan Richter
Hi, If rescaling is the problem, let me clarify that you can currently rescale from savepoints and all types of checkpoints (including incremental). If that was the only problem, then there is nothing to worry about - the documentation is only a bit conservative about this because we will not

Re: Query on job restoration using relocated savepoint

2019-04-11 Thread Stefan Richter
Hi, the first case sounds like you made a mistake when editing the paths manually and deleted one ore more bytes that were not part of the path and thus corrupted the meta data. For the second approach, of course you also need to replace the paths after reading and before rewriting the

Re: Query on job restoration using relocated savepoint

2019-04-11 Thread Stefan Richter
Small correction, on the first case: more likely is that you changed the path string but I think those are prefixed by the string length, so that would require manual adjustment as well to not corrupt the metadata. > On 11. Apr 2019, at 14:42, Stefan Richter wrote: > > Hi, > >

Re: [ANNOUNCE] Apache Flink 1.8.0 released

2019-04-10 Thread Stefan Richter
Congrats and thanks to Aljoscha for managing the release! Best, Stefan > On 10. Apr 2019, at 13:01, Biao Liu wrote: > > Great news! Thanks Aljoscha and all the contributors. > > Till Rohrmann mailto:trohrm...@apache.org>> > 于2019年4月10日周三 下午6:11写道: > Thanks a lot to Aljoscha for being our

Re: Migrating Existing TTL State to 1.8

2019-03-13 Thread Stefan Richter
Hi, If you are worried about old state, you can combine the compaction filter based TTL with other cleanup strategies (see docs). For example, setting `cleanupFullSnapshot` when you take a savepoint it will be cleared of any expired state and you can then use it to bring it into Flink 1.8.

Re: Partitions and the number of cores/executors

2019-03-13 Thread Stefan Richter
Hi, Your assumption is right. Parallel processing is based in splitting inputs and each split is only processed by one task instance at a time. Best, Stefan > On 13. Mar 2019, at 09:52, mbilalce@gmail.com wrote: > > Hi, > > I am working with Gelly graph library but I think the question

Re: Will state TTL support event time cleanup in 1.8?

2019-03-13 Thread Stefan Richter
TTL based on event time is not part or 1.8, but likely to be part of 1.9. > On 13. Mar 2019, at 13:17, Sergei Poganshev wrote: > > Do improvements introduced in > https://issues.apache.org/jira/browse/FLINK-10471 > add support for event >

Re: Understanding timestamp and watermark assignment errors

2019-03-13 Thread Stefan Richter
Hi, I think this looks like the same problem as in this issue: https://issues.apache.org/jira/browse/FLINK-11420 Best, Stefan > On 13. Mar 2019, at 09:41, Konstantin Knauf wrote: > > Hi Andrew, > > generally, this looks like a

Re: [ANNOUNCE] New Flink PMC member Thomas Weise

2019-02-12 Thread Stefan Richter
Congrats Thomas!, Best, Stefan > Am 12.02.2019 um 11:20 schrieb Stephen Connolly > : > > Congratulations to Thomas. I see that this is not his first time in the PMC > rodeo... also somebody needs to update LDAP as he's not on > https://people.apache.org/phonebook.html?pmc=flink >

Re: H-A Deployment : Job / task manager configuration

2019-02-06 Thread Stefan Richter
Hi, You only need to do the configuration in conf/flink-conf.yaml on the job manager. The configuration will be shipped to the TMs. Best, Stefan > On 5. Feb 2019, at 16:59, bastien dine wrote: > > Hello everyone, > > I would like to know what exactly I need to configure on my job / task >

Re: JDBCAppendTableSink on Data stream

2019-02-06 Thread Stefan Richter
Hi, That should be no problem, for example the `JDBCAppendTableSinkTest` is using it also with data stream. Best, Stefan > On 6. Feb 2019, at 07:29, Chirag Dewan wrote: > > Hi, > > In the documentation, the JDBC sink is mentioned as a source on Table > API/stream. > > Can I use the same

Re: improper use of releaseJob() without a matching number of createTmpFiles() calls for jobId

2019-01-22 Thread Stefan Richter
Hi, Which version of Flink are you using? This issue https://issues.apache.org/jira/browse/FLINK-10283 shows that a similar problem was fixed in 1.6.1 and 1.7. If you use a newer version and still encounter the problem, you can reopen the

Re: [SURVEY] Custom RocksDB branch

2019-01-22 Thread Stefan Richter
+1 from me as well. > On 22. Jan 2019, at 10:47, aitozi wrote: > > +1 from my side, since we rely on this feature to implement the real state > ttl . > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Subtask much slower than the others when creating checkpoints

2019-01-15 Thread Stefan Richter
Hi, I have seen a few cases where for certain jobs a small imbalance in the state partition assignment did cascade into a larger imbalance of the job. If your max parallelism mod parallelism is not 0, it means that some tasks have one partition more than others. Again, depending on how much

Re: Bug in RocksDB timer service

2019-01-15 Thread Stefan Richter
Hi, I have never seen this before. I would assume to see this exception because the write batch is flushed and contained a write against a column family that does not exist (anymore). However, we initialize everything relevant in RocksDBCachingPriorityQueueSet as final (CF handle) and never

Re: Flink Streaming Job Task is getting cancelled silently and causing job to restart

2019-01-09 Thread Stefan Richter
Hi, Could you also provide the job master log? Best, Stefan > On 9. Jan 2019, at 12:02, sohimankotia wrote: > > Hi, > > I am running Flink Streaming Job with 1.5.5 version. > > - Job is basically reading from Kafka , windowing on 2 minutes , and writing > to hdfs using AvroBucketing Sink .

Re: windowAll and AggregateFunction

2019-01-09 Thread Stefan Richter
Hi, I think your expectation about windowAll is wrong, from the method documentation: “Note: This operation is inherently non-parallel since all elements have to pass through the same operator instance” and I also cannot think of a way in which the windowing API would support your use case

Re: Can checkpoints be used to migrate jobs between Flink versions ?

2019-01-09 Thread Stefan Richter
Hi, I would assume that this should currently work because the format of basic savepoints and checkpoints is the same right now. The restriction in the doc is probably there in case that the checkpoint format will diverge more in the future. Best, Stefan > On 9. Jan 2019, at 13:12, Edward

Re: Zookeeper shared by Flink and Kafka

2019-01-09 Thread Stefan Richter
Hi, That is more a ZK question than a Flink question, but I don’t think there is a problem. Best, Stefan > On 9. Jan 2019, at 13:31, min@ubs.com wrote: > > Hi, > > I am new to Flink. > > I have a question: > Can a zookeeper cluster be shared by a flink cluster and a kafka cluster? >

Re: Error while reading from hadoop sequence file

2018-12-13 Thread Stefan Richter
apache.flink.client.program.ClusterClient.getOptimizedPlan(ClusterClient.java:906) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:473) > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) > >

Re: CodeCache is full - Issues with job deployments

2018-12-13 Thread Stefan Richter
Hi, Thanks for analyzing the problem. If it turns out that there is a problem with the termination of the Kafka sources, could you please open an issue for that with your results? Best, Stefan > On 11. Dec 2018, at 19:04, PedroMrChaves wrote: > > Hello Stefan, > > Thank you for the reply.

Re: CodeCache is full - Issues with job deployments

2018-12-11 Thread Stefan Richter
Hi, in general, Flink uses user-code class loader for job specific code and the lifecycle of the class loader should end with the job. This usually means that job related code could be removed after the job is finished. However, objects of a class that was loaded by the user-code class loader

Re: Error while reading from hadoop sequence file

2018-12-11 Thread Stefan Richter
Hi, I am a bit confused by the explanation, the exception that you mentioned, is it happening in the first code snippet ( with the TypeInformation.of(…)) or the second one? From looking into the code, I would expect the exception can only happen in the second snippet (without TypeInformation)

Re: Very slow checkpoints occasionally occur

2018-12-11 Thread Stefan Richter
Hi, Looking at the numbers, it seems to me that checkpoint execution (the times of the sync and async part) are always reasonable fast once they are executed on the task, but there are changes in the alignment time and the time from triggering a checkpoint to executing a checkpoint. As you are

Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-11 Thread Stefan Richter
Hi, Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet?

Re: Failed to resume job from checkpoint

2018-12-10 Thread Stefan Richter
observe > the created files and what happens with the one that is mentioned in the > exception or introduce debug logging in the code, in particular a check if > the listed file (the link target) does exist before linking, which it should > in my opinion because it is listed in the directory.

Re: Failed to resume job from checkpoint

2018-12-07 Thread Stefan Richter
recovered checkpoint is also 1.7.0 . > > Stefan Richter <mailto:s.rich...@data-artisans.com>> 于2018年12月7日周五 下午11:06写道: > Just to clarify, the checkpoint from which you want to resume in 1.7, was > that taken by 1.6 or by 1.7? So far this is a bit mysterious because it says > File

Re: Failed to resume job from checkpoint

2018-12-07 Thread Stefan Richter
gt; org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123) > ... 7 more > 2018-12-06 22:53:41,287 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: > topic.rate (1/16) (5637f1c3568ca7c29db002e

Re: Failed to resume job from checkpoint

2018-12-07 Thread Stefan Richter
Hi, From what I can see in the log here, it looks like your RocksDB is not recovering from local but from a remote filesystem. This recovery basically has steps: 1: Create a temporary directory (in your example, this is the dir that ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download

Re: IntervalJoin is stucked in rocksdb'seek for too long time in flink-1.6.2

2018-11-22 Thread Stefan Richter
Btw how did you make sure that it is stuck in the seek call and that the trace does not show different invocations of seek? This can indicate that seek is slow, but is not yet proof that you are stuck. > On 22. Nov 2018, at 13:01, liujiangang wrote: > > This is not my case. Thank you. > > >

Re: IntervalJoin is stucked in rocksdb'seek for too long time in flink-1.6.2

2018-11-22 Thread Stefan Richter
Hi, are your RocksDB instances running on local SSDs or on something like EBS? If have previously seen cases where this happened because some EBS quota was exhausted and the performance got throttled. Best, Stefan > On 22. Nov 2018, at 09:51, liujiangang wrote: > > Thank you very much. I

Re: ClassNotFoundException: org.apache.kafka.common.metrics.stats.Rate$1

2018-11-20 Thread Stefan Richter
Hi, It should be as easy as making sure that there is a jar with the missing class in the class path of your user-code class loader. Best, Stefan > On 20. Nov 2018, at 14:32, Avi Levi wrote: > > looking at the log file of the taskexecutor I see this exception > at

Re: Tentative release date for 1.6.3

2018-11-20 Thread Stefan Richter
Hi, there is no release date for 1.6.3, yet. Best, Stefan > On 20. Nov 2018, at 12:18, Shailesh Jain wrote: > > Hi, > > Is the tentative release date for 1.6.3 decided? > > Thanks, > Shailesh

Re: FlinkCEP, circular references and checkpointing failures

2018-11-08 Thread Stefan Richter
nd also I could do a quick test in our > environment once the issue is resolved. > > Thanks, > Shailesh > > On Wed, Nov 7, 2018, 10:46 PM Till Rohrmann <mailto:trohrm...@apache.org> wrote: > Really good finding Stefan! > > On Wed, Nov 7, 2018 at 5:28 PM Stefan R

Re: FlinkCEP, circular references and checkpointing failures

2018-11-07 Thread Stefan Richter
Hi, I think I can already spot the problem: LockableTypeSerializer.duplicate() is not properly implemented because it also has to call duplicate() on the element serialiser that is passed into the constructor of the new instance. I will open an issue and fix the problem. Best, Stefan > On 7.

Re: Error restoring from savepoint while there's no modification to the job

2018-10-15 Thread Stefan Richter
Hi, I see, then the important question for me is if the problem exists on the release/master code or just on your branches. Of course we can hardly give any advice for custom builds and without any code. In general, you should debug in HeapKeyedStateBackend lines lines 774-776 (the write part)

Re: Error restoring from savepoint while there's no modification to the job

2018-10-15 Thread Stefan Richter
Hi, I think it is rather unlikely that this is the problem because it should give a different kind of exception. Would it be possible to provide a minimal and self-contained example code for a problematic job? Best, Stefan > On 15. Oct 2018, at 08:29, Averell wrote: > > Hi everyone, > >

Re: Flink leaves a lot RocksDB sst files in tmp directory

2018-10-12 Thread Stefan Richter
Hi, Can you maybe show us what is inside of one of the directory instance? Furthermore, your TM logs show multiple instances of OutOfMemoryErrors, so that might also be a problem. Also how was the job moved? If a TM is killed, of course it cannot cleanup. That is why the data goes to tmp dir

Re: Large rocksdb state restore/checkpoint duration behavior

2018-10-10 Thread Stefan Richter
Hi, I would assume that the problem about blocked processing during a checkpoint is caused by [1], because you mentioned the use of RocksDB incremental checkpoints and it could be that you use it in combination with heap-based timers. This is the one combination that currently still uses a

Re: Error restoring from savepoint while there's no modification to the job

2018-10-10 Thread Stefan Richter
Hi, adding to Dawids questions, it would also be very helpful to know which Flink version was used to create the savepoint, which Flink version was used in the restore attempt, if the savepoint was moved or modified. Outside of potential conflicts with those things, I would not expect anything

Re: ***UNCHECKED*** Error while confirming Checkpoint

2018-10-08 Thread Stefan Richter
Hi Pedro, unfortunately the interesting parts are all removed from the log, we already know about the exception itself. In particular, what I would like to see is what checkpoints have been triggered and completed before the exception happens. Best, Stefan > Am 08.10.2018 um 10:23 schrieb

Re: Data loss when restoring from savepoint

2018-10-04 Thread Stefan Richter
ht be a factor. How would you assume that backpressure would influence your updates? Updates to each local state still happen event-by-event, in a single reader/writing thread. > > On Thu, Oct 4, 2018 at 4:18 PM Stefan Richter <mailto:s.rich...@data-artisans.com>> wrote: > Hi

Re: Data loss when restoring from savepoint

2018-10-04 Thread Stefan Richter
Hi, you could take a look at Bravo [1] to query your savepoints and to check if the state in the savepoint complete w.r.t your expectations. I somewhat doubt that there is a general problem with the state/savepoints because many users are successfully running it on a large state and I am not

Re: Rocksdb Metrics

2018-09-25 Thread Stefan Richter
Hi, this feature is tracked here https://issues.apache.org/jira/browse/FLINK-10423 Best, Stefan > Am 25.09.2018 um 17:51 schrieb Sayat Satybaldiyev : > > Flink provides a rich number of metrics. However, I didn't find any metrics > for

Re: ArrayIndexOutOfBoundsException

2018-09-25 Thread Stefan Richter
der Smirnov > : > > Thanks Stefan. > > is it only Flink runtime should be updated, or the job should be recompiled > too? > Is there a workaround to start the job without upgrading Flink? > > Alex > > On Tue, Sep 25, 2018 at 5:48 PM Stefan Richter <mailto:s.

Re: ArrayIndexOutOfBoundsException

2018-09-25 Thread Stefan Richter
Hi, this problem looks like https://issues.apache.org/jira/browse/FLINK-8836 which would also match to your Flink version. I suggest to update to 1.4.3 or higher to avoid the issue in the future. Best, Stefan > Am 25.09.2018 um 16:37 schrieb

Re: ***UNCHECKED*** Error while confirming Checkpoint

2018-09-25 Thread Stefan Richter
Hi, I cannot spot anything bad or „wrong“ about your job configuration. Maybe you can try to save and send the logs if it happens again? Did you observe this only once, often, or is it something that is even reproduceable? Best, Stefan > Am 24.09.2018 um 10:15 schrieb PedroMrChaves : > >

Re: ***UNCHECKED*** Error while confirming Checkpoint

2018-09-21 Thread Stefan Richter
Hi, could you provide some logs for this problematic job because I would like to double check the reason why this violated precondition did actually happen? Thanks, Stefan > Am 20.09.2018 um 17:24 schrieb Stefan Richter : > > FYI, here a link to my PR: https://github.com/apache/f

Re: Flink can't initialize operator state backend when starting from checkpoint

2018-09-21 Thread Stefan Richter
Hi, that is correct. If you are using custom serializers you should double check their correctness, maybe using our test base for type serializers. Another reason could be that you modified the job in a way that silently changed the schema somehow. Concurrent use of serializers across

Re: multiple flink applications on yarn are shown in one application.

2018-09-21 Thread Stefan Richter
gt; I browse the yarn application. As the picture shows I got 2 applications(0013 > / 0012) but the job in both applications is the same. I can’t find the job > submitted secondly. The job in application_XXX_0013 should be > rpt_company_user_s. This will not happen in session mode.

Re: Flink takes 40ms ~ 100ms to proceed from one operator to another?

2018-09-20 Thread Stefan Richter
Oh yes exactly, enable is right. > Am 20.09.2018 um 17:48 schrieb Hequn Cheng : > > Hi Stefan, > > Do you mean enable object reuse? > If you want to reduce latency between chained operators, you can also try to > disable object-reuse: > > On Thu, Sep 20, 2018

Re: ***UNCHECKED*** Error while confirming Checkpoint

2018-09-20 Thread Stefan Richter
FYI, here a link to my PR: https://github.com/apache/flink/pull/6723 > Am 20.09.2018 um 14:52 schrieb Stefan Richter : > > Hi, > > I think the failing precondition is too strict because sometimes a checkpoint > can overtake another checkpoint and in that case the commit is a

Re: Unit/Integration test for stateful source

2018-09-20 Thread Stefan Richter
Hi, maybe you can use AbstractStreamOperatorTestHarness to test your source, including the snapshotting. You can take a look at the tests of some other source, e.g. StatefulSequenceSourceTest#testCheckpointRestore. Best, Stefan > Am 20.09.2018 um 15:29 schrieb Darshan Singh : > > Hi, > > I

Re: Flink takes 40ms ~ 100ms to proceed from one operator to another?

2018-09-20 Thread Stefan Richter
Sorry, forgot the link for reference [1], which is https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/datastream_api.html#controlling-latency > Am 20.09.2018 um 16:36 schrieb Stefan Richter : > > Hi, > > you provide not very much information for this question, e.

Re: Flink takes 40ms ~ 100ms to proceed from one operator to another?

2018-09-20 Thread Stefan Richter
Hi, you provide not very much information for this question, e.g. what and how exactly your measure, if this is a local or distributed setting etc. I assume that it is distributed and that the cause for your observation is the buffer timeout, i.e. the maximum time that Flink waits until

Re: ***UNCHECKED*** Error while confirming Checkpoint

2018-09-20 Thread Stefan Richter
Hi, I think the failing precondition is too strict because sometimes a checkpoint can overtake another checkpoint and in that case the commit is already subsumed. I will open a Jira and PR with a fix. Best, Stefan > Am 19.09.2018 um 10:04 schrieb PedroMrChaves : > > Hello, > > I have a

Re: Trying to figure out why a slot takes a long time to checkpoint

2018-09-20 Thread Stefan Richter
A debug log for state backend and checkpoint coordinator could also help. > Am 20.09.2018 um 14:19 schrieb Stefan Richter : > > Hi, > > if some tasks take like 50 minutes, could you wait until such a checkpoint is > in progress and (let’s say after 10 minutes) log into t

Re: Trying to figure out why a slot takes a long time to checkpoint

2018-09-20 Thread Stefan Richter
Hi, if some tasks take like 50 minutes, could you wait until such a checkpoint is in progress and (let’s say after 10 minutes) log into the node and create a (or multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing task. This could help to figure out where it is

Re: How to get the location of keytab when using flink on yarn

2018-09-20 Thread Stefan Richter
Hi, maybe Aljoscha or Eron (both in CC) can help you with this problem, I think they might know best about the Kerberos security. Best, Stefan > Am 20.09.2018 um 11:20 schrieb 杨光 : > > Hi, > i am using the " per-job YARN session " mode deploy flink job on yarn and my > flink > version is

Re: Writer has already been opened on BucketingSink to S3

2018-09-20 Thread Stefan Richter
Hi, thanks for putting some effort into debugging the problem. Could you open a Jira with the problem and your analysis so that we can discuss how to proceed with it? Best, Stefan > Am 18.09.2018 um 23:16 schrieb Chengzhi Zhao : > > After checking the code, I think the issue might be related

Re: S3 connector Hadoop class mismatch

2018-09-20 Thread Stefan Richter
Hi, I could not find any open Jira for the problem you describe. Could you please open one? Best, Stefan > Am 19.09.2018 um 09:54 schrieb Paul Lam : > > Hi, > > I’m using StreamingFileSink of 1.6.0 to write logs to S3 and encounter a > classloader problem. It seems that there are conflicts

Re: multiple flink applications on yarn are shown in one application.

2018-09-20 Thread Stefan Richter
Hi, currently, Flink still has to use session mode under the hood if you submit the job in attached-mode. The reason is that the job could consists of multiple parts that require to run one after the other. This will be changed in the future and also should not happen if you submit the job

Re: Checkpointing not working

2018-09-20 Thread Stefan Richter
Hi, in the absence of any logs, my guess would be that your checkpoints are just not able to complete within 10 seconds, the state might be to large or the network and fs to slow. Are you using full or incremental checkpoints? For your relative small interval, I suggest that you try using

Re: Why FlinkKafkaConsumer08 does not support exactly once or at least once semantic?

2018-09-20 Thread Stefan Richter
Hi, I think this part of the documentation is talking about KafkaProducer, and you are reading in the source code of KafkaConsumer. Best, Stefan > Am 20.09.2018 um 10:48 schrieb 徐涛 : > > Hi All, > In document of Flink 1.6, it says that "Before 0.9 Kafka did not > provide any mechanisms

Re: Flink 1.5.2 process keeps reference to deleted blob files.

2018-09-20 Thread Stefan Richter
Hi, I think it would be very helpful if you could identify what data is behind. For example, I could imagine that it can be a jar file that was used by the TM and some classes are still in use or loaded by a classloader that was not yet GCed. Depending on that, there could be a problem in the

Re: How to use checkpoint in flink1.5.3

2018-09-20 Thread Stefan Richter
Hi, did you introduce some custom modifications to the code? Your stack trace does not match the lines in the code of release-1.5.3, e.g. line 230 is not in method internalTimeServiceManager(…) which makes it hard to draw any conclusions. Best, Stefan > Am 19.09.2018 um 14:03 schrieb

Re: Trying to figure out why a slot takes a long time to checkpoint

2018-09-18 Thread Stefan Richter
Adding to my previous email, I start to doubt a little bit about the explanation because also alignment times are very low. Could it be possible that it takes very long for the checkpoint operation (for whatever reason) to get the checkpointing lock? > Am 18.09.2018 um 11:58 schrieb Ste

Re: Trying to figure out why a slot takes a long time to checkpoint

2018-09-18 Thread Stefan Richter
Hi, from your screenshot, it looks like everything is running fine as soon as the snapshots are actually running, sync and async part times are normal. So I think the explanation is the time that the checkpoint barrier needs to reach this particular operator. It seems like there is a large

Re: Flink 1.3.2 RocksDB map segment fail if configured as state backend

2018-09-17 Thread Stefan Richter
Hi, I think the exception pretty much says what is wrong, the native library cannot be mapped into the process because of some access rights problem. Please make sure that your path /tmp has the exec right. Best, Stefan > Am 17.09.2018 um 11:37 schrieb Andrea Spina : > > Hi everybody, > > I

Re: After OutOfMemoryError State can not be readed

2018-09-07 Thread Stefan Richter
Hi, what I can say is that any failures like OOMs should not corrupt checkpoint files, because only successfully completed checkpoints are used for recovery by the job manager. Just to get a bit more info, are you using full or incremental checkpoints? Unfortunately, it is a bit hard to say

Re: Increased Size of Incremental Checkpoint

2018-09-06 Thread Stefan Richter
Hi, you should expect that the size can vary for some checkpoints, even if the change rate is constant. Some checkpoints will upload compacted replacements for previous checkpoints to prevent that the checkpoint history will grow without bounds. Whenever that happens, the checkpoint does some

Re: Error while trigger checkpoint due to Kyro Exception

2018-08-19 Thread Stefan Richter
Hi, this problem is fixed in Flink >= 1.4.3, see https://issues.apache.org/jira/browse/FLINK-8836 . Best, Stefan > Am 18.08.2018 um 05:07 schrieb Bruce Qiu : > > Hi Community, > I am using Flink 1.4.2 to do streaming processing. I fetch data

Re: Need a clarification about removing a stateful operator

2018-08-17 Thread Stefan Richter
Hi, it will not be transported. The JM does the state assignment to create the deployment information for all tasks. If will just exclude the state for operators that are not present. So in your next checkpoints they will no longer be contained. Best, Stefan > Am 17.08.2018 um 09:26 schrieb

Re: Window state with rocksdb backend

2018-08-09 Thread Stefan Richter
Hi, it is not quiet clear to me what your window function is doing, so sharing some (pseudo) code would be helpful. Is it ever calling a update-function for the state you are trying to modify? From the information I have it seems not the be the case and that is a wrong use of the API which

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread Stefan Richter
n of security complainace to climb before we > will in Production if we need to go that route. > > Thanks, Ashish > > On Monday, July 23, 2018, 5:10 AM, Stefan Richter > wrote: > > Hi, > > I am wondering how this can even work properly if you are using a local fs &g

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
fig().enableExternalizedCheckpoints( > CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); > env.getCheckpointConfig().setMaxConcurrentCheckpoints(1); > > > The last thing I am intended to try is using FSSatebackend to be sure if its > rocksDB related issue, but the problem

Re: Permissions to delete Checkpoint on cancel

2018-07-23 Thread Stefan Richter
Hi, I am wondering how this can even work properly if you are using a local fs for checkpoints instead of a distributed fs. First, what happens under node failures, if the SSD becomes unavailable or if a task gets scheduled to a different machine, and can no longer access the disk with the

Re: Memory Leak in ProcessingTimeSessionWindow

2018-07-23 Thread Stefan Richter
Hi, for most windows, all state is cleared through FIRE_AND_PURGE, except for windows that are subtypes of merging windows, such as session windows. Here, the state still remembers the window itself until the watermark passes the session timeout+allowed lateness. This is done so that elements

Re: Events can overtake watermarks

2018-07-23 Thread Stefan Richter
Hi, events overtaking watermarks doesn’t sound like a „wrong“ behaviour, only watermarks overtaking events would be bad. Do you think this only stated from Flink 1.5? To me this does not sound like a problem, but not sure if it is intended. Looping in Aljoscha, just in case. Best, Stefan >

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
e.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) > at > org.apache.flink.streaming.runtim

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread Stefan Richter
apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping > response for sessionid: 0x2648355f7c6010f after 10ms > 2018-07-12 14:36:11,957 DEBUG > org.apache.flink.runtime.taskmanager.TaskManager - Sending > heartbeat to JobManager > > As I expected checkpoint also st

Re: help understand/debug high memory footprint on jobmanager

2018-06-29 Thread Stefan Richter
Hi Steven, from your analysis, I would conclude the following problem. ExecutionVertexes hold executions, which are bootstrapped with the state (in form of the map of state handles) when the job is initialized from a checkpoint/savepoint. It holds a reference on this state, even when the task

Re: Storing Streaming Data into Static source

2018-06-26 Thread Stefan Richter
Hi, I think this is not really a Flink related question. In any case, you might want to specify a bit more what you mean by „better", because usually there is no strict better but trade-offs and what is „better“ to somebody might not be „better“ for you. Best, Stefan > Am 26.06.2018 um 12:54

Re: String Interning

2018-06-26 Thread Stefan Richter
Hi, you can enable object reuse via the execution config [1]: „By default, objects are not reused in Flink. Enabling the object reuse mode will instruct the runtime to reuse user objects for better performance. Keep in mind that this can lead to bugs when the user-code function of an operation

Re: Memory Leak in ProcessingTimeSessionWindow

2018-06-20 Thread Stefan Richter
ers anyone might have. > > Thanks, Ashish > > On Monday, June 18, 2018, 12:54:03 PM EDT, ashish pok > wrote: > > > Right, thats where I am headed now but was wondering there are any “gochas” I > am missing before I try and dig into a few gigs of heap dump. > > >

Re: Memory Leak in ProcessingTimeSessionWindow

2018-06-18 Thread Stefan Richter
Hi, can you take a heap dump from a JVM that runs into the problem and share it with us? That would make finding the cause a lot easier. Best, Stefan > Am 15.06.2018 um 23:01 schrieb ashish pok : > > All, > > I have another slow Memory Leak situation using basic TimeSession Window >

Re: Clarity on Flink 1.5 Rescale mechanism

2018-06-12 Thread Stefan Richter
Hi, it means that you can now modify the parallelism of a running job with a new „modify“ command in the CLI, see [1]. Adding task manager will add their offered slots to the pool of available slots, it will not automatically change the parallelism. Best, Stefan [1]

Re: Stopping of a streaming job empties state store on HDFS

2018-06-11 Thread Stefan Richter
Hi, > Am 08.06.2018 um 01:16 schrieb Peter Zende : > > Hi all, > > We have a streaming pipeline (Flink 1.4.2) for which we implemented stoppable > sources to be able to gracefully exit from the job with Yarn state > "finished/succeeded". > This works fine, however after creating a savepoint,

Re: Having a backoff while experiencing checkpointing failures

2018-06-11 Thread Stefan Richter
Hi, I think the behaviour of min_pause_between_checkpoints is either buggy or we should at least discuss if it would not be better to respect a pause also for failed checkpoints. As far as I know there is no ongoing work to add backoff, so I suggest you open a jira issue and make a case for

Re: Kryo Exception

2018-05-25 Thread Stefan Richter
I agree, it looks like one of the two mentioned issues. > Am 25.05.2018 um 06:15 schrieb sihua zhou : > > Hi Gordon, > > I think this might not be caused by > https://issues.apache.org/jira/browse/FLINK-9263 > , the bug in

Re: Missing MapState when Timer fires after restored state

2018-05-18 Thread Stefan Richter
effects. > > As for the bug I faced, indeed I was able to reproduce it consistently. And I > have provided TRACE-level logs personally to Stefan. If there is no Jira > ticket for this yet, would you like me to create one? > > On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <

Re: are there any ways to test the performance of rocksdb state backend?

2018-05-18 Thread Stefan Richter
Hi, Sihua is right, of course we would like to update our RocksDB version but are currently blocked on a performance regression. Here is our issue in the RocksDB tracker for this: https://github.com/facebook/rocksdb/issues/3865 . Best, Stefan

Re: Missing MapState when Timer fires after restored state

2018-05-16 Thread Stefan Richter
se but INFO > level logs: > > > > > > > > > > > > > > > > > > Maybe I need to edit the log4j.properties instead(?). Indeed it's Flink >

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
t simply exceed the limit. If you have an overview how many parallel operator instances with keyed state were running on the machine and assume some reasonable number of files per RocksDB instance and the limit configured in your OS, could that be the case? > > Thanks! Thanks for you

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
open files“ problem only happen with local recovery (asking since it should actually not add the the amount of open files), and did you deactivate it on the second cluster for the restart or changed your OS settings? > Am 15.05.2018 um 10:09 schrieb Stefan Richter <s.rich...@data-artisa

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
e tweaking, because I > don't want to tangle with the same kafka consumer group offsets or send old > data again to production endpoint. > > Please keep in mind that there was that Too Many Open Files issue on the > cluster that created the problematic checkpoint, if you think that's relevant. >

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
Hi, I agree, this looks like a bug. Can you tell us your exact configuration of the state backend, e.g. if you are using incremental checkpoints or not. Are you using the local recovery feature? Are you restarting the job from a checkpoint or a savepoint? Can you provide logs for both the job

  1   2   3   4   >