I am trying to build a low latency machine learning system from scratch using
apache ignite.  Note: I am in design phase, and have not implemented
anything yet.

General data pipeline is:
Json Data via socket -> Ignite Cache -> Ignite ML (Updating) -> Ignite Cache
-> App (via continuous query) (maybe Ignite ML -> App via socket for
improved latency)

I am trying to minimise latency, and also improve ml speed.
Obviously the distributed in-memory colocated processing is quite useful for
high performance ml when dealing with lots of data.

However, I am wondering:
1) What is best practice for performing various operations to improve
latency / ml performance
2) Whether there could be fundamental changes in ignite framework to better
support such thing

One important thing here could be serialisation / deserialisation speed.
This includes json ->(some object) -> cache.  and cache -> (some object) ->
ml vector

So optimal would be to serialise from json direct into cache representation
(binaryserialise?), and straight from cache into ml vector?  Is this
possible?  Any best practice? Did this make it easier:
https://issues.apache.org/jira/browse/IGNITE-13672

As well as potentially looking at more optimal method of representing data
to improve ml performance.  I have heard that storing columnar data is quite
useful: https://arrow.apache.org/overview/
Is it possible that something like this could be implemented as an
alternative cache memory architecture?  If this is not possible - then is
there an alternative to the java array [] / Vector on heap, that seems to be
used in the ml algorithms?  Is it possible for the ml algorithms to work on
the data in place (in the cache), without having to retrieve it (is this
what IgniteRDD does for spark?)

The two considerations would be improving ml algorithm performance, and as
well mininising (de)serialisation overhead.

Thanks ! 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to