Hi, I write machine learning code in java on top of hadoop. This involves (de-)serializing the learned models to and from files on hdfs or, more generally, byte streams.
The model is usually represented at some stage as a huge double[] (think gigabytes) and some additional meta data in the form of Map<String, String> (tiny, less than 100 entries). When serializing, I'd like to satisfy the following desiderata: (1) Do not, never ever, copy the double[] to (de-)serialize it and never box the doubles into Double instances. The model size is usually chosen based on available memory, so there is no wiggle room... (2) Serialize using a defined schema and make sure that the recipient can get the schema. Requirement (2) is satisfied by using the specific API and AVRO's files (do they work on HDFS?). However, using that API entails copying the data from double[] into avro's data structures and vice versa. Requirement (1) can be satisfied by using the Binary[De|En]coder API as Doug described to me on this mailinglist last October. Now the question: Is there a standard way of achieving both? If I can, I'd like avoiding writing special-cased code for this... Thanks, Markus
