Re: One-time Initialization of in-memory data using a data file

Thomas Weise Mon, 23 Jan 2017 15:44:55 -0800

First link without frame:

https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/org/apache/apex/malhar/lib/state/managed/AbstractManagedStateImpl.html



On Mon, Jan 23, 2017 at 3:33 PM, Thomas Weise <t...@apache.org> wrote:
> Roger,
>
> An Apex operator typically holds state that it uses for processing and
> often that state is mutable. For large state: "Managed state" in
> Malhar (and its predecessor HDHT) were designed for large state that
> can be mutated efficiently under a specific write pattern (semi
> ordered keys). However, there is no benefit of using these for
> immutable data that is already in HDFS.
>
> In such case it would be best to store them (during migration/ingest)
> in HDFS a file format that allows for fast random reads (block
> structured files like HFile or TFile or any other indexed structure
> provide that).
>
> Also, depending on how the data, once in memory, would be used, an
> Apex operator may or may not be the right home. If the goal is to only
> lookup data without further processing with a synchronous
> request/response pattern, then an IMDG or similar system may be a more
> appropriate solution.
>
> Here are pointers for managed state:
>
> https://ci.apache.org/projects/apex-malhar/apex-malhar-javadoc-release-3.6/index.html
> https://github.com/apache/apex-malhar/blob/master/benchmark/src/main/java/com/datatorrent/benchmark/state/ManagedStateBenchmarkApp.java
>
> Thanks,
> Thomas
>
>
> On Sun, Jan 22, 2017 at 11:43 PM, Ashwin Chandra Putta
> <ashwinchand...@gmail.com> wrote:
>> Roger,
>>
>> Depending on the certain requirements on expected latency, size of data etc,
>> the operator's design will change.
>>
>> If latency needs to be lowest possible, meaning completely in-memory and not
>> hitting the disk for read I/O, there are two scenarios
>> 1. If the lookup data size is small --> just load to memory in the setup
>> call, switch off checkpointing to get rid off checkpoint I/O latency in
>> between. In case of operator restarts, the data should be reloaded in setup.
>> 2. If the lookup data is large --> have many partitions of this operator to
>> minimize the footprint of each partition. Still switch off checkpointing and
>> reload in setup in case of operator restart. Having many partitions will
>> ensure that the setup load is fast. The incoming query needs to be
>> partitioned based on the lookup key.
>>
>> You can use the PojoEnricher with FSLoader for above design.
>>
>> Code:
>> https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
>> Example:
>> https://github.com/DataTorrent/examples/tree/master/tutorials/enricher
>>
>> In case of large lookup dataset and latency caused by disk read I/O is fine,
>> then use HDHT or managed state as a backup mechanism for the in-memory data
>> to decrease the checkpoint footprint. I could not find example for managed
>> state but here are the links for HDHT..
>>
>> Code:
>> https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht
>> Example:
>> https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java
>>
>> Regards,
>> Ashwin.
>>
>> On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <san...@datatorrent.com>
>> wrote:
>>>
>>> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader
>>> in the malhar-library – not sure whether it gives you reading the whole file
>>> into memory.
>>>
>>>
>>>
>>> Also there is a library called Megh at https://github.com/DataTorrent/Megh
>>> where you might find some useful operators like
>>> com.datatorrent.contrib.hdht.hfile.HFileImpl .
>>>
>>>
>>>
>>> From: Roger F <rf301...@gmail.com>
>>> Reply-To: <users@apex.apache.org>
>>> Date: Sunday, January 22, 2017 at 9:32 PM
>>> To: <users@apex.apache.org>
>>> Subject: One-time Initialization of in-memory data using a data file
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have a use case where application business data needs migrated from a
>>> legacy system (such as mainframe) into HDFS and then loaded for use by an
>>> Apex application.
>>>
>>> To get this done, an approach that is being considered to perform one-time
>>> initialization of the data from the HDFS into application memory. This data
>>> will then be queried for various business logic functions of the
>>> application.
>>>
>>> Once the data is loaded, this operator/module (?) should no longer perform
>>> any further function except for acting as a master of this data and then
>>> supporting operations to query the data (via a key).
>>>
>>> Any pointers to how this can be done ? I was looking for an operator or
>>> any other entity which can load this data at startup (Activation or Setup)
>>> and then allow queries to be submitted to it via an input port.
>>>
>>>
>>>
>>> -R
>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ashwin.

Re: One-time Initialization of in-memory data using a data file

Reply via email to