In principle, it would be really nice if we could parametrize our desire
for larger entity sets / vocabularies (have keys of type 'long' vs. 'int')
and
our precision on values ('float' vs. 'double' vs even 'boolean').But while we've talked about this, adding a proliferation of FloatVector, DoubleVector (and BooleanVector), together with LongMatrix vs IntMatrix would really complicate all of the higher-level apis. It's possible, but could be ugly. So then we could instead standardize everything to "one size fits all", and break backwards compatibility with either all Taste users, or all of our algorithms and data in the classification / clustering / vectorization codebase. Or we could write some simple utilities (some we already have) to convert formats internally when needed (warning when collisions are possible on key range folding, and possibly losing precision or bloating the data size). I think the latter approach is probably best, IMO. On Tue, Aug 16, 2011 at 11:04 AM, Jeff Hansen <[email protected]> wrote: > When I first started reading the Manning book, I was a little surprised by > the description of data structures for preferences in the collaborative > filtering section. Before getting the book I had really only played around > with the Vector implementations and I was used to the Vectors being generic > lists of <int, double> pairs. So I was a little bit surprised to read the > description of all the collaborative filtering implementations using > generic > lists of <long, float> pairs. > > I was wondering if I could get some general comments on the reason for this > disparity. I'm guessing it's a matter of history and optimization -- taste > was optimized for storing more info at the index level and less at the > "rating" level whereas vectors were intended to be generic with the ability > to maintain the maximum amount of precision. Unfortunately the lowest > common denominator is int/float, so if you want to go between models you > have to fit into the smaller footprint constraint of each without getting > the benefit of the smaller footprint constraint of each... > > It ends up feeling like there are two faces to mahout which are somewhat > incompatible. Are there any thoughts about bridging the gap between the > two > models in the future? If this really is a matter of each model being > optimized for it's problem space, maybe it would just help to have a clear > delineation of which utilities belong on which side of the fence -- as well > as some utility for shifting generic types between the models (with the > warning that there might be loss of precision or the ability to maintain as > many ids). That way utilities that already exist on the one side could be > reused on the other side. >
