Hi! I cannot find the screenshots you attached. The Apache Mailing lists sometimes don't support attachments, can you link to the screenshots some way else?
Stephan On Mon, Feb 20, 2017 at 8:36 PM, vinay patil <vinay18.pa...@gmail.com> wrote: > Hi Stephan, > > Just saw your mail while I was explaining the answer to your earlier > questions. I have attached some more screenshots which are taken from the > latest run today. > Yes I will try to set it to higher value and check if performance improves > > Let me know your thoughts > > Regards, > Vinay Patil > > On Tue, Feb 21, 2017 at 12:51 AM, Stephan Ewen [via Apache Flink User > Mailing List archive.] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=11760&i=0>> wrote: > >> @Vinay! >> >> Just saw the screenshot you attached to the first mail. The checkpoint >> that failed came after one that had an incredible heavy alignment phase (14 >> GB). >> I think that working that off threw the next checkpoint because the >> workers were still working off the alignment backlog. >> >> I think you can for now fix this by setting the minimum pause between >> checkpoints a bit higher (it is probably set a bit too small for the state >> of your application). >> >> Also, can you describe what your sources are (Kafka / Kinesis or file >> system)? >> >> BTW: We are currently working on >> - incremental RocksDB checkpoints >> - the network stack to allow in the future for a new way of doing the >> alignment >> >> Both of that should help that the program is more resilient to these >> situations. >> >> Best, >> Stephan >> >> >> >> On Mon, Feb 20, 2017 at 7:51 PM, Stephan Ewen <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=11758&i=0>> wrote: >> >>> Hi Vinay! >>> >>> Can you start by giving us a bit of an environment spec? >>> >>> - What Flink version are you using? >>> - What is your rough topology (what operations does the program use) >>> - Where is the state (windows, keyBy)? >>> - What is the rough size of your checkpoints and where does the time >>> go? Can you attach a screenshot from https://ci.apache.org/pro >>> jects/flink/flink-docs-release-1.2/monitoring/checkpoint_monitoring.html >>> - What is the size of the JVM? >>> >>> Those things would be helpful to know... >>> >>> Best, >>> Stephan >>> >>> >>> On Mon, Feb 20, 2017 at 7:04 PM, vinay patil <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=11758&i=1>> wrote: >>> >>>> Hi Xiaogang, >>>> >>>> Thank you for your inputs. >>>> >>>> Yes I have already tried setting MaxBackgroundFlushes and >>>> MaxBackgroundCompactions to higher value (tried with 2, 4, 8) , still not >>>> getting expected results. >>>> >>>> System.getProperty("java.io.tmpdir") points to /tmp but there I could >>>> not find RocksDB logs, can you please let me know where can I find it ? >>>> >>>> Regards, >>>> Vinay Patil >>>> >>>> On Mon, Feb 20, 2017 at 7:32 AM, xiaogang.sxg [via Apache Flink User >>>> Mailing List archive.] <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=11752&i=0>> wrote: >>>> >>>>> Hi Vinay >>>>> >>>>> Can you provide the LOG file in RocksDB? It helps a lot to figure out >>>>> the problems becuse it records the options and the events happened >>>>> during the execution. Otherwise configured, it should locate at the >>>>> path set in System.getProperty("java.io.tmpdir"). >>>>> >>>>> Typically, a large amount of memory is consumed by RocksDB to store >>>>> necessary indices. To avoid the unlimited growth in the memory >>>>> consumption, >>>>> you can put these indices into block cache (set CacheIndexAndFilterBlock >>>>> to >>>>> true) and properly set the block cache size. >>>>> >>>>> You can also increase the number of backgroud threads to improve the >>>>> performance of flushes and compactions (via MaxBackgroundFlushes and >>>>> MaxBackgroudCompactions). >>>>> >>>>> In YARN clusters, task managers will be killed if their memory >>>>> utilization exceeds the allocation size. Currently Flink does not count >>>>> the >>>>> memory used by RocksDB in the allocation. We are working on fine-grained >>>>> resource allocation (see FLINK-5131). It may help to avoid such problems. >>>>> >>>>> May the information helps you. >>>>> >>>>> Regards, >>>>> Xiaogang >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> 发件人:Vinay Patil <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11731&i=0>> >>>>> 发送时间:2017年2月17日(星期五) 21:19 >>>>> 收件人:user <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11731&i=1>> >>>>> 主 题:Re: Checkpointing with RocksDB as statebackend >>>>> >>>>> Hi Guys, >>>>> >>>>> There seems to be some issue with RocksDB memory utilization. >>>>> >>>>> Within few minutes of job run the physical memory usage increases by >>>>> 4-5 GB and it keeps on increasing. >>>>> I have tried different options for Max Buffer Size(30MB, 64MB, 128MB , >>>>> 512MB) and Min Buffer to Merge as 2, but the physical memory keeps on >>>>> increasing. >>>>> >>>>> According to RocksDB documentation, these are the main options on >>>>> which flushing to storage is based. >>>>> >>>>> Can you please point me where am I doing wrong. I have tried different >>>>> configuration options but each time the Task Manager is getting killed >>>>> after some time :) >>>>> >>>>> Regards, >>>>> Vinay Patil >>>>> >>>>> On Thu, Feb 16, 2017 at 6:02 PM, Vinay Patil <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11731&i=2>> wrote: >>>>> I think its more of related to RocksDB, I am also not aware about >>>>> RocksDB but reading the tuning guide to understand the important values >>>>> that can be set >>>>> >>>>> Regards, >>>>> Vinay Patil >>>>> >>>>> On Thu, Feb 16, 2017 at 5:48 PM, Stefan Richter [via Apache Flink User >>>>> Mailing List archive.] <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11731&i=3>> wrote: >>>>> What kind of problem are we talking about? S3 related or RocksDB >>>>> related. I am not aware of problems with RocksDB per se. I think seeing >>>>> logs for this would be very helpful. >>>>> >>>>> Am 16.02.2017 um 11:56 schrieb Aljoscha Krettek <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=0>>: >>>>> >>>>> [hidden email] <http:///user/SendEmail.jtp?type=node&node=11673&i=1> >>>>> and [hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=2> could this be >>>>> the same problem that you recently saw when working with other people? >>>>> >>>>> On Wed, 15 Feb 2017 at 17:23 Vinay Patil <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=3>> wrote: >>>>> Hi Guys, >>>>> >>>>> Can anyone please help me with this issue >>>>> >>>>> Regards, >>>>> Vinay Patil >>>>> >>>>> On Wed, Feb 15, 2017 at 6:17 PM, Vinay Patil <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=4>> wrote: >>>>> Hi Ted, >>>>> >>>>> I have 3 boxes in my pipeline , 1st and 2nd box containing source and >>>>> s3 sink and the 3rd box is window operator followed by chained operators >>>>> and a s3 sink >>>>> >>>>> So in the details link section I can see that that S3 sink is taking >>>>> time for the acknowledgement and it is not even going to the window >>>>> operator chain. >>>>> >>>>> But as shown in the snapshot ,checkpoint id 19 did not get any >>>>> acknowledgement. Not sure what is causing the issue >>>>> >>>>> Regards, >>>>> Vinay Patil >>>>> >>>>> On Wed, Feb 15, 2017 at 5:51 PM, Ted Yu [via Apache Flink User Mailing >>>>> List archive.] <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=5>> wrote: >>>>> What did the More Details link say ? >>>>> >>>>> Thanks >>>>> >>>>> > On Feb 15, 2017, at 3:11 AM, vinay patil <[hidden email] >>>>> <http://user/SendEmail.jtp?type=node&node=11641&i=0>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I have kept the checkpointing interval to 6secs and minimum pause >>>>> between >>>>> > checkpoints to 5secs, while testing the pipeline I have observed >>>>> that that >>>>> > for some checkpoints it is taking long time , as you can see in the >>>>> attached >>>>> > snapshot checkpoint id 19 took the maximum time before it gets >>>>> failed, >>>>> > although it has not received any acknowledgements, now during this >>>>> 10minutes >>>>> > the entire pipeline did not make any progress and no data was >>>>> getting >>>>> > processed. (For Ex : In 13minutes 20M records were processed and >>>>> when the >>>>> > checkpoint took time there was no progress for the next 10minutes) >>>>> > >>>>> > I have even tried to set max checkpoint timeout to 3min, but in that >>>>> case as >>>>> > well multiple checkpoints were getting failed. >>>>> > >>>>> > I have set RocksDB FLASH_SSD_OPTION >>>>> > What could be the issue ? >>>>> > >>>>> > P.S. I am writing to 3 S3 sinks >>>>> > >>>>> > checkpointing_issue.PNG >>>>> > <http://apache-flink-user-mailing-list-archive.2336050.n4.na >>>>> bble.com/file/n11640/checkpointing_issue.PNG> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > View this message in context: http://apache-flink-user-maili >>>>> ng-list-archive.2336050.n4.nabble.com/Checkpointing-with-Roc >>>>> ksDB-as-statebackend-tp11640.html >>>>> > Sent from the Apache Flink User Mailing List archive. mailing list >>>>> archive at Nabble.com. >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>>> ble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11641.html >>>>> To start a new topic under Apache Flink User Mailing List archive., >>>>> email [hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11673&i=6> >>>>> To unsubscribe from Apache Flink User Mailing List archive., click >>>>> here >>>>> <#m_-110681228480864290_m_-370635408291964005_m_3724869264661144930_m_6198963695418156302_m_8892162958879126193_this> >>>>> . >>>>> NAML >>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>>> ble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11673.html >>>>> To start a new topic under Apache Flink User Mailing List archive., >>>>> email [hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=11731&i=4> >>>>> To unsubscribe from Apache Flink User Mailing List archive., click >>>>> here. >>>>> NAML >>>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>>>> >>>>> >>> >> >> ------------------------------ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >> ble.com/Checkpointing-with-RocksDB-as-statebackend-tp11640p11731.html >> To start a new topic under Apache Flink User Mailing List archive., email >> [hidden >> email] <http:///user/SendEmail.jtp?type=node&node=11752&i=1> >> To unsubscribe from Apache Flink User Mailing List archive., click here. >> NAML >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >> >> ------------------------------ >> View this message in context: Re: Checkpointing with RocksDB as >> statebackend >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752.html> >> >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> >> at Nabble.com. >> > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050.n4. > nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend- > tp11752p11758.html > To start a new topic under Apache Flink User Mailing List archive., email > [hidden > email] <http:///user/SendEmail.jtp?type=node&node=11760&i=1> > To unsubscribe from Apache Flink User Mailing List archive., click here. > NAML > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > ------------------------------ > View this message in context: Re: Checkpointing with RocksDB as > statebackend > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-tp11752p11760.html> > Sent from the Apache Flink User Mailing List archive. mailing list archive > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/> at > Nabble.com. >