Re: One-time Initialization of in-memory data using a data file

Amol Kekre Mon, 23 Jan 2017 16:09:33 -0800

Roger,
I am guessing here. The ask seems to be for an Apex app that otherwise
would query data from Mainframe to not do so. The aim being to offload
key-val lookup from Mainframe. Instead you are looking to see if the query
can be from an operator that does what a key-val store would do for you
(via an input port).


In such a case HDHT/ManagedState would work. It is under-utilization of
what they do, but it will work. You can put in an IMDG that the Apex app
can directly query from. In essence you would replace Mainframe with an
IMDG. Functionally the set up will work just as Mainframe. Operationally
however do check if you want to put up another system (IMDG in this case)
for your devOps to support. From TCO perspective you may get different
answers from your devOps.

Thks,
Amol


On Mon, Jan 23, 2017 at 3:44 PM, Thomas Weise <[email protected]> wrote:

> First link without frame:
>
> https://ci.apache.org/projects/apex-malhar/apex-
> malhar-javadoc-release-3.6/org/apache/apex/malhar/lib/state/managed/
> AbstractManagedStateImpl.html
>
>
> On Mon, Jan 23, 2017 at 3:33 PM, Thomas Weise <[email protected]> wrote:
> > Roger,
> >
> > An Apex operator typically holds state that it uses for processing and
> > often that state is mutable. For large state: "Managed state" in
> > Malhar (and its predecessor HDHT) were designed for large state that
> > can be mutated efficiently under a specific write pattern (semi
> > ordered keys). However, there is no benefit of using these for
> > immutable data that is already in HDFS.
> >
> > In such case it would be best to store them (during migration/ingest)
> > in HDFS a file format that allows for fast random reads (block
> > structured files like HFile or TFile or any other indexed structure
> > provide that).
> >
> > Also, depending on how the data, once in memory, would be used, an
> > Apex operator may or may not be the right home. If the goal is to only
> > lookup data without further processing with a synchronous
> > request/response pattern, then an IMDG or similar system may be a more
> > appropriate solution.
> >
> > Here are pointers for managed state:
> >
> > https://ci.apache.org/projects/apex-malhar/apex-
> malhar-javadoc-release-3.6/index.html
> > https://github.com/apache/apex-malhar/blob/master/
> benchmark/src/main/java/com/datatorrent/benchmark/state/
> ManagedStateBenchmarkApp.java
> >
> > Thanks,
> > Thomas
> >
> >
> > On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta
> > <[email protected]> wrote:
> >> Roger,
> >>
> >> Depending on the certain requirements on expected latency, size of data
> etc,
> >> the operator's design will change.
> >>
> >> If latency needs to be lowest possible, meaning completely in-memory
> and not
> >> hitting the disk for read I/O, there are two scenarios
> >> 1. If the lookup data size is small --> just load to memory in the setup
> >> call, switch off checkpointing to get rid off checkpoint I/O latency in
> >> between. In case of operator restarts, the data should be reloaded in
> setup.
> >> 2. If the lookup data is large --> have many partitions of this
> operator to
> >> minimize the footprint of each partition. Still switch off
> checkpointing and
> >> reload in setup in case of operator restart. Having many partitions will
> >> ensure that the setup load is fast. The incoming query needs to be
> >> partitioned based on the lookup key.
> >>
> >> You can use the PojoEnricher with FSLoader for above design.
> >>
> >> Code:
> >> https://github.com/apache/apex-malhar/blob/master/
> contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
> >> Example:
> >> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher
> >>
> >> In case of large lookup dataset and latency caused by disk read I/O is
> fine,
> >> then use HDHT or managed state as a backup mechanism for the in-memory
> data
> >> to decrease the checkpoint footprint. I could not find example for
> managed
> >> state but here are the links for HDHT..
> >>
> >> Code:
> >> https://github.com/DataTorrent/Megh/tree/master/
> contrib/src/main/java/com/datatorrent/contrib/hdht
> >> Example:
> >> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/
> test/java/com/example/HDHTAppTest.java
> >>
> >> Regards,
> >> Ashwin.
> >>
> >> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <[email protected]
> >
> >> wrote:
> >>>
> >>> You may want to take a look at com.datatorrent.lib.
> fileaccess.DTFileReader
> >>> in the malhar-library – not sure whether it gives you reading the
> whole file
> >>> into memory.
> >>>
> >>>
> >>>
> >>> Also there is a library called Megh at https://github.com/
> DataTorrent/Megh
> >>> where you might find some useful operators like
> >>> com.datatorrent.contrib.hdht.hfile.HFileImpl .
> >>>
> >>>
> >>>
> >>> From: Roger F <[email protected]>
> >>> Reply-To: <[email protected]>
> >>> Date: Sunday, January 22, 2017 at 9:32 PM
> >>> To: <[email protected]>
> >>> Subject: One-time Initialization of in-memory data using a data file
> >>>
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I have a use case where application business data needs migrated from a
> >>> legacy system (such as mainframe) into HDFS and then loaded for use by
> an
> >>> Apex application.
> >>>
> >>> To get this done, an approach that is being considered to perform
> one-time
> >>> initialization of the data from the HDFS into application memory. This
> data
> >>> will then be queried for various business logic functions of the
> >>> application.
> >>>
> >>> Once the data is loaded, this operator/module (?) should no longer
> perform
> >>> any further function except for acting as a master of this data and
> then
> >>> supporting operations to query the data (via a key).
> >>>
> >>> Any pointers to how this can be done ? I was looking for an operator or
> >>> any other entity which can load this data at startup (Activation or
> Setup)
> >>> and then allow queries to be submitted to it via an input port.
> >>>
> >>>
> >>>
> >>> -R
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Regards,
> >> Ashwin.
>

Re: One-time Initialization of in-memory data using a data file

Reply via email to