Hi,

I write machine learning code in java on top of hadoop. This involves
(de-)serializing the learned models to and from files on hdfs or, more
generally, byte streams.

The model is usually represented at some stage as a huge double[] (think
gigabytes) and some additional meta data in the form of Map<String, String>
(tiny, less than 100 entries).

When serializing, I'd like to satisfy the following desiderata:

(1) Do not, never ever, copy the double[] to (de-)serialize it and never box
the doubles into Double instances. The model size is usually chosen based on
available memory, so there is no wiggle room...

(2) Serialize using a defined schema and make sure that the recipient can
get the schema.

Requirement (2) is satisfied by using the specific API and AVRO's files (do
they work on HDFS?).  However, using that API entails copying the data from
double[] into avro's data structures and vice versa. Requirement (1) can be
satisfied by using the Binary[De|En]coder API as Doug described to me on
this mailinglist last October.

Now the question: Is there a standard way of achieving both? If I can, I'd
like avoiding writing special-cased code for this...

Thanks,

Markus

Reply via email to