Ok, then what would you think to be the minimum number of columns in the
dataset for Lanczos to give a reasonable result?

Thanks,
-Trevor

> A gazillion rows of 2-columned data is really much better suited to doing
> the following:
>
> if each row is of the form [a, b], then compute the matrix
>
> [[a*a, a*b], [a*b, b*b]]
>
> (the outer product of the vector with itself)
>
> Then take the matrix sum of all of these, from each row of your input
> matrix.
>
> You'll now have a 2x2 matrix, which you can diagonalize by hand.  It will
> give you your eigenvalues, and also the right-singular vectors of your
> original matrix.
>
>   -jake
>
> 2011/6/23 <[email protected]>
>
>> Yes, exactly why I asked it for only 2 eigenvalues. So what is being
>> said,
>> is if I have lets say 50M rows of 2 columned data, Lanczos can't do
>> anything with it (assuming it puts the 0 eigenvalue in the mix - of the
>> 2
>> eigenvectors only 1 is returned because of the 0 eigenvalue taking up a
>> slot)?
>>
>> If the eigenvalue of 0 is invalid, then should it not be filtered out so
>> that it returns "rank" number of eigenvalues that could be valid?
>>
>> -Trevor
>>
>> > Ah, if your matrix only has 2 columns, you can't go to rank 10.  Try
>> on
>> > some slightly less synthetic data of more than rank 10.  You can't
>> > ask Lanczos for more reduced rank than that of the matrix itself.
>> >
>> >   -jake
>> >
>> > 2011/6/23 <[email protected]>
>> >
>> >> Alright I can reorder that is easy, just had to verify that the
>> ordering
>> >> was correct. So when I increased the rank of the results I get
>> Lanczos
>> >> bailing out. Which incidentally causes a NullPointerException:
>> >>
>> >> INFO: 9 passes through the corpus so far...
>> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = NaN.
>> >> Bailing out early!
>> >> INFO: Lanczos iteration complete - now to diagonalize the
>> tri-diagonal
>> >> auxiliary matrix.
>> >> Exception in thread "main" java.lang.NullPointerException
>> >>        at
>> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
>> >>        at
>> >>
>> >>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
>> >>        at pca.PCASolver.solve(PCASolver.java:53)
>> >>        at pca.PCA.main(PCA.java:20)
>> >>
>> >> So I should probably note that my data only has 2 columns, the real
>> data
>> >> will have quite a bit more.
>> >>
>> >> The failing happens with 10 and more for rank, with the last, and
>> >> therefore most significant eigenvector being <NaN,NaN>.
>> >>
>> >> -Trevor
>> >> > The 0 eigenvalue output is not valid, and yes, the output will list
>> >> the
>> >> > results
>> >> > in *increasing* order, even though it is finding the largest
>> >> > eigenvalues/vectors
>> >> > first.
>> >> >
>> >> > Remember that convergence is gradual, so if you only ask for 3
>> >> > eigevectors/values, you won't be very accurate.  If you ask for 10
>> or
>> >> > more,
>> >> > the
>> >> > largest few will now be quite good.  If you ask for 50, now the top
>> >> 10-20
>> >> > will
>> >> > be *extremely* accurate, and maybe the top 30 will still be quite
>> >> good.
>> >> >
>> >> > Try out a non-distributed form of what is in the
>> EigenverificationJob
>> >> to
>> >> > re-order the output and collect how accurate your results are (it
>> >> computes
>> >> > errors for you as well).
>> >> >
>> >> >   -jake
>> >> >
>> >> > 2011/6/23 <[email protected]>
>> >> >
>> >> >> So, I know that MAHOUT-369 fixed a bug with the distributed
>> version
>> >> of
>> >> >> the
>> >> >> LanczosSolver but I am experiencing a similar problem with the
>> >> >> non-distributed version.
>> >> >>
>> >> >> I send a dataset of gaussian distributed numbers (testing PCA
>> stuff)
>> >> and
>> >> >> my eigenvalues are seemingly reversed. Below I have the output
>> given
>> >> in
>> >> >> the logs from LanczosSolver.
>> >> >>
>> >> >> Output:
>> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804
>> >> >> INFO: LanczosSolver finished.
>> >> >>
>> >> >> So it returns a vector with eigenvalue 0 before one with an
>> >> eigenvalue
>> >> >> of
>> >> >> 347?. Whats more interesting is that when I increase the rank, I
>> get
>> >> a
>> >> >> new
>> >> >> eigenvector with a value between 0 and 347:
>> >> >>
>> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566
>> >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704
>> >> >>
>> >> >> Shouldn't the eigenvalues be in descending order? Also is the 0.0
>> >> >> eigenvalue even valid?
>> >> >>
>> >> >> Thanks,
>> >> >> Trevor
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>


Reply via email to