First link without frame: https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/org/apache/apex/malhar/lib/state/managed/AbstractManagedStateImpl.html
On Mon, Jan 23, 2017 at 3:33 PM, Thomas Weise <t...@apache.org> wrote: > Roger, > > An Apex operator typically holds state that it uses for processing and > often that state is mutable. For large state: "Managed state" in > Malhar (and its predecessor HDHT) were designed for large state that > can be mutated efficiently under a specific write pattern (semi > ordered keys). However, there is no benefit of using these for > immutable data that is already in HDFS. > > In such case it would be best to store them (during migration/ingest) > in HDFS a file format that allows for fast random reads (block > structured files like HFile or TFile or any other indexed structure > provide that). > > Also, depending on how the data, once in memory, would be used, an > Apex operator may or may not be the right home. If the goal is to only > lookup data without further processing with a synchronous > request/response pattern, then an IMDG or similar system may be a more > appropriate solution. > > Here are pointers for managed state: > > https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/index.html > https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/com/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java > > Thanks, > Thomas > > > On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta > <ashwinchand...@gmail.com> wrote: >> Roger, >> >> Depending on the certain requirements on expected latency, size of data etc, >> the operator's design will change. >> >> If latency needs to be lowest possible, meaning completely in-memory and not >> hitting the disk for read I/O, there are two scenarios >> 1. If the lookup data size is small --> just load to memory in the setup >> call, switch off checkpointing to get rid off checkpoint I/O latency in >> between. In case of operator restarts, the data should be reloaded in setup. >> 2. If the lookup data is large --> have many partitions of this operator to >> minimize the footprint of each partition. Still switch off checkpointing and >> reload in setup in case of operator restart. Having many partitions will >> ensure that the setup load is fast. The incoming query needs to be >> partitioned based on the lookup key. >> >> You can use the PojoEnricher with FSLoader for above design. >> >> Code: >> https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java >> Example: >> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher >> >> In case of large lookup dataset and latency caused by disk read I/O is fine, >> then use HDHT or managed state as a backup mechanism for the in-memory data >> to decrease the checkpoint footprint. I could not find example for managed >> state but here are the links for HDHT.. >> >> Code: >> https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht >> Example: >> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java >> >> Regards, >> Ashwin. >> >> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <san...@datatorrent.com> >> wrote: >>> >>> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader >>> in the malhar-library – not sure whether it gives you reading the whole file >>> into memory. >>> >>> >>> >>> Also there is a library called Megh at https://github.com/DataTorrent/Megh >>> where you might find some useful operators like >>> com.datatorrent.contrib.hdht.hfile.HFileImpl . >>> >>> >>> >>> From: Roger F <rf301...@gmail.com> >>> Reply-To: <users@apex.apache.org> >>> Date: Sunday, January 22, 2017 at 9:32 PM >>> To: <users@apex.apache.org> >>> Subject: One-time Initialization of in-memory data using a data file >>> >>> >>> >>> Hi, >>> >>> I have a use case where application business data needs migrated from a >>> legacy system (such as mainframe) into HDFS and then loaded for use by an >>> Apex application. >>> >>> To get this done, an approach that is being considered to perform one-time >>> initialization of the data from the HDFS into application memory. This data >>> will then be queried for various business logic functions of the >>> application. >>> >>> Once the data is loaded, this operator/module (?) should no longer perform >>> any further function except for acting as a master of this data and then >>> supporting operations to query the data (via a key). >>> >>> Any pointers to how this can be done ? I was looking for an operator or >>> any other entity which can load this data at startup (Activation or Setup) >>> and then allow queries to be submitted to it via an input port. >>> >>> >>> >>> -R >> >> >> >> >> -- >> >> Regards, >> Ashwin.