Roger,

Depending on the certain requirements on expected latency, size of data
etc, the operator's design will change.

If latency needs to be lowest possible, meaning completely in-memory and
not hitting the disk for read I/O, there are two scenarios
1. If the lookup data size is small --> just load to memory in the setup
call, switch off checkpointing to get rid off checkpoint I/O latency in
between. In case of operator restarts, the data should be reloaded in setup.
2. If the lookup data is large --> have many partitions of this operator to
minimize the footprint of each partition. Still switch off checkpointing
and reload in setup in case of operator restart. Having many partitions
will ensure that the setup load is fast. The incoming query needs to be
partitioned based on the lookup key.

You can use the PojoEnricher with FSLoader for above design.

Code:
https://github.com/apache/apex-malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/enrich/POJOEnricher.java
Example:
https://github.com/DataTorrent/examples/tree/master/tutorials/enricher

In case of large lookup dataset and latency caused by disk read I/O is
fine, then use HDHT or managed state as a backup mechanism for the
in-memory data to decrease the checkpoint footprint. I could not find
example for managed state but here are the links for HDHT..

Code:
https://github.com/DataTorrent/Megh/tree/master/contrib/src/main/java/com/datatorrent/contrib/hdht
Example:
https://github.com/DataTorrent/examples/blob/master/tutorials/hdht/src/test/java/com/example/HDHTAppTest.java

Regards,
Ashwin.

On Sun, Jan 22, 2017 at 10:45 PM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> You may want to take a look at com.datatorrent.lib.fileaccess.DTFileReader
> in the malhar-library – not sure whether it gives you reading the whole
> file into memory.
>
>
>
> Also there is a library called Megh at https://github.com/DataTorrent/Megh
> where you might find some useful operators like
> com.datatorrent.contrib.hdht.hfile.HFileImpl .
>
>
>
> *From: *Roger F <rf301...@gmail.com>
> *Reply-To: *<users@apex.apache.org>
> *Date: *Sunday, January 22, 2017 at 9:32 PM
> *To: *<users@apex.apache.org>
> *Subject: *One-time Initialization of in-memory data using a data file
>
>
>
> Hi,
>
> I have a use case where application business data needs migrated from a
> legacy system (such as mainframe) into HDFS and then loaded for use by an
> Apex application.
>
> To get this done, an approach that is being considered to perform one-time
> initialization of the data from the HDFS into application memory. This data
> will then be queried for various business logic functions of the
> application.
>
> Once the data is loaded, this operator/module (?) should no longer perform
> any further function except for acting as a master of this data and then
> supporting operations to query the data (via a key).
>
> Any pointers to how this can be done ? I was looking for an operator or
> any other entity which can load this data at startup (Activation or Setup)
> and then allow queries to be submitted to it via an input port.
>
>
>
> -R
>



-- 

Regards,
Ashwin.

Reply via email to