No checkpoints are active. I will try that back end. Yes, using JSONObject subclass for most of the intermediate state, with JSON strings in and out of Kafka. I will look at the config page for how to enable that.
Thank you, Michael > On Apr 17, 2018, at 12:51 PM, Stephan Ewen <se...@apache.org> wrote: > > A few ideas how to start debugging this: > > - Try deactivating checkpoints. Without that, no work goes into persisting > rocksdb data to the checkpoint store. > - Try to swap RocksDB for the FsStateBackend - that reduces serialization > cost for moving data between heap and offheap (rocksdb). > - Do you have some expensive types (JSON, etc)? Try activating object reuse > (which avoids some extra defensive copies) > > On Tue, Apr 17, 2018 at 5:50 PM, TechnoMage <mla...@technomage.com > <mailto:mla...@technomage.com>> wrote: > Memory use is steady throughout the job, but the CPU utilization drops off a > cliff. I assume this is because it becomes I/O bound shuffling managed state. > > Are there any metrics on managed state that can help in evaluating what to do > next? > > Michael > > >> On Apr 17, 2018, at 7:11 AM, Michael Latta <mla...@technomage.com >> <mailto:mla...@technomage.com>> wrote: >> >> Thanks for the suggestion. The task manager is configured for 8GB of heap, >> and gets to about 8.3 total. Other java processes (job manager and Kafka). >> Add a few more. I will check it again but the instances have 16GB same as my >> laptop that completes the test in <90 min. >> >> Michael >> >> Sent from my iPad >> >> On Apr 16, 2018, at 10:53 PM, Niclas Hedhman <nic...@hedhman.org >> <mailto:nic...@hedhman.org>> wrote: >> >>> >>> Have you checked memory usage? It could be as simple as either having >>> memory leaks, or aggregating more than you think (sometimes not obvious how >>> much is kept around in memory for longer than one first thinks). If >>> possible, connect FlightRecorder or similar tool and keep an eye on memory. >>> Additionally, I don't have AWS experience to talk of, but IF AWS swaps RAM >>> to disk like regular Linux, then that might be triggered if your JVM heap >>> is bigger than can be handled within the available RAM. >>> >>> On Tue, Apr 17, 2018 at 9:26 AM, TechnoMage <mla...@technomage.com >>> <mailto:mla...@technomage.com>> wrote: >>> I am doing a short Proof of Concept for using Flink and Kafka in our >>> product. On my laptop I can process 10M inputs in about 90 min. On 2 >>> different EC2 instances (m4.xlarge and m5.xlarge both 4core 16GB ram and >>> ssd storage) I see the process hit a wall around 50min into the test and >>> short of 7M events processed. This is running zookeeper, kafka broker, >>> flink all on the same server in all cases. My goal is to measure single >>> node vs. multi-node and test horizontal scalability, but I would like to >>> figure out why hit hits a wall first. I have the task maanger configured >>> with 6 slots and the job has 5 parallelism. The laptop has 8 threads, and >>> the EC2 instances have 4 threads. On smaller data sets and in the begining >>> of each test the EC2 instances outpace the laptop. I will try again with >>> an m5.2xlarge which has 8 threads and 32GB ram to see if that works better >>> for this workload. Any pointers or ways to get metrics that would help >>> diagnose this would be appreciated. >>> >>> Michael >>> >>> >>> >>> >>> -- >>> Niclas Hedhman, Software Developer >>> http://polygene.apache.org <http://polygene.apache.org/> - New Energy for >>> Java > >