On Tue, Oct 12, 2010 at 5:30 PM, Lance Norskog <[email protected]> wrote:
> This use case is doing Random Projection with "paired vectors". Look up > 'semantic vectors' for an explanation. > Even so, I think that there is another way to do this by just keeping an id on each vector. In random projection, it is common to use a random matrix whose elements that can be regenerated at will. This allows us to not actually transfer the elements of the random matrix and makes it possible to use pieces of the same random matrix in different places. If you want others to comment on your detailed use case, it would help if you could explain it more fully here. I don't see any real need for a payload in my understanding of paired random indexing. My pipeline is three different M/R jobs in sequence with three different > semantics for the output vectors. The payload has to be included in all > three output sets. So I really do want a good vector I/O toolkit. > Copying this payload doesn't necessarily make sense as I pointed out previously. If it does, and you don't need to pass vector + payload through normal Mahout code, then it is pretty trivial for you to devise your own writable data structure as Sean suggested. Your data structure can include a VectorWritable along with anything else you like. > p.s. If you understand the math of why 2 flat-distribution random numbers > added create a pyramidal distribution, please write. I 'm attempting to > reverse this effect. [email protected] > This is a consequence of the law of large numbers. The distribution of a sum of a number of random variables drawn independently from the same base distribution with finite variance will tend uniformly to the normal distribution that has variance equal to the base variance multiplied by the number of elements being summed. The convergence is very quick. In fact, the sum of 12 uniform [-0.5, 0.5] deviates was often used in the dark ages (aka the golden years) of computing as a way to quickly generate a unit normal deviate. The cumulative distribution of such a sum is a piecewise 12th order polynomial that tracks the normal distribution very closely. I will put up a more detailed explanation on my blog where I can draw pretty pictures and write mathematical notation, but the crux of the argument that if you are adding two random variables x and y, then the region where there is non-zero probability is the square [0,1] x [0,1]. For a given value of x + y = z, there is a diagonal line where that value holds and x and y are in that square. Where z <= 0 or z>=2 that intersection vanishes and for 0 < z < 2, that intersection varies in length. The probability of the sum having some particular value z is proportional to the length of that intersection. As you can imagine, the intersection varies in size linearly and it reaches a maximum where z = 1. For the sum of three random variables, we now have the intersection of a cubical region with a plane and the probability is proportional to the area of that intersection. This takes on a more complex form than with two variables, being composed of regions with a quadratic form depending on whether we are near the ends or the middle of the cube. As to your question of how to get ride of the non-uniformity, it is almost always a bad idea to try to eliminate this with random projections. Much better is to simply use a normal distribution instead of a uniform distribution. There are several reasons for this. First, the sum of two normally distributed variables is also normally distributed since the normal distribution is the fixed point for random variables under addition. This means that you don't have to worry about what the distribution of your sums will be; you already know. Secondly, if you are dealing with random projections, then the distribution of the sum of products of random variables becomes very important. With the normal distribution, you can pretty easily determine what this distribution is. If you started with uniform distributions, you would have a much harder time of it and have to resort to approximation by normal distributions. Some people think that random projections should be entirely composed of positive values. A better way to do this would be to use a log-linear (soft-max) link function to project R^n into the positive orthant.
