Hi,
Over at Mahout (http://lucene.apache.org/mahout) we have a Vector
interface with two implementations DenseVector and SparseVector. When
it comes to writing Mapper/Reducer, we have been able to just use
Vector, but when it comes to actually binding real data via a
Configuration, we need to specify, I think, the actual implementation
being used, as in something like
conf.setOutputValueClass(SparseVector.class);
Ideally, we'd like to avoid having to pick a particular implementation
to as late as possible. Right now, we've pushed this off to the user
to pass in the implementation, but even that is less than ideal for a
variety of reasons. While we typically wouldn't expect the data to be
a mixture of Dense and Sparse, there really shouldn't be a reason why
it can't be. We realize we could write out the class name to the
DataOutput (we implement Writable) that causes us to have either hack
some String compares in or use Class.forName(), which seems like it
wouldn't perform well (although I admit I haven't tested that yet,
presumably the JDK can cache the info)
Thanks,
Grant