This use case is doing Random Projection with "paired vectors". Look up
'semantic vectors' for an explanation.
My pipeline is three different M/R jobs in sequence with three different
semantics for the output vectors. The payload has to be included in all
three output sets. So I really do want a good vector I/O toolkit.
p.s. If you understand the math of why 2 flat-distribution random
numbers added create a pyramidal distribution, please write. I 'm
attempting to reverse this effect. [email protected]
Lance
Ted Dunning wrote:
The use case for augmenting vector writable would be where you have a bunch
of vectors that you want to cluster and you want to keep around auxiliary
data associated with each vector rather than do a join down-stream from the
clustering.
I say do the join. The cost won't be that different. The clustering will
go faster for not having to schlep around the payload which will probably
more than compensate for having to read the file to join later. Many
processes will preserve order so the final join can be done as a map-side
merge. Where the map-side merge isn't possible, then you may have to do a
full reduce side join, but that is still going to be close to break-even.
On Tue, Oct 12, 2010 at 3:51 PM, Sean Owen<[email protected]> wrote:
If that's all that's meant -- seems like you just want to write
VectorAndThingWritable rather than inject an optional Thing into
VectorWritable. It'd work either way but seems cleaner to compose it that
way. VectorAndThingWritable might belong in core depending on how general
"Thing" is.
On Tue, Oct 12, 2010 at 10:41 PM, Ted Dunning<[email protected]>
wrote:
There is currently no provision for a payload in the VectorWritable. It
is
plausible that such a capability could be added.