Hi Ted,
My apologies for not framing the question on clusterdumper properly. I am
getting the output from clusterdumper in the expected format. A sample
vector from the clusterdumper output is as shown below:
1.0: /all-exchanges-strings.lc.txt = [amex:0.161, ase:0.161, asx:0.161,
biffex:0.161, bse:0.161, cboe:0.161, cbt:0.161, cme:0.161, comex:0.161,
cse:0.161, fox:0.136, fse:0.161, hkse:0.161, ipe:0.161, jse:0.161,
klce:0.161, klse:0.161, liffe:0.161, lme:0.161, lse:0.161, mase:0.161,
mise:0.161, mnse:0.161, mose:0.161, nasdaq:0.161, nyce:0.161, nycsce:0.161,
nymex:0.161, nyse:0.161, ose:0.161, pse:0.161, set:0.136, simex:0.161,
sse:0.161, stse:0.161, tose:0.161, tse:0.161, wce:0.161, zse:0.161]
What I originally wanted to know is that are this vectors just the way
clusterdumper prints them( i.e. are they dense vectors) or are they sparse
vectors and the clusterdumper iterates over the non-zero values and prints
only those values. If they are sparse vectors, Can you kindly tell me in
which directory are the vectors generated by the algorithm so I can read
them.
If the vectors are in dense format then I need to convert them to sparse
vectors. As can be seen from the clusterdump outsput sample above,only the
features which have non-zero values for each vector are being printed. the
set of features which have non-zero values will differ from vector to
vector. Consider we have 3 vectors f1,f2,f3 each with a set of nonzero
features s1,s2 and s3 respectively. What I want is a set
S={s1 U s2 U s3}
i.e. S is the union of the sets of non-zero features for each vector so
that I can convert the dense vectors to sparse vectors.
Your thoughts on this are welcome.
Thanks,
Ashvini
On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning <[email protected]> wrote:
> Aside from your issues with clusterdumper, the values you want can be had
> from a sparse vector using v.iterateNonZero() and v.norm(0).
>
> The issue with clusterdumper is odd.
>
> Are you saying that the display shows all the components of the vector? Or
> that there is an in-memory representation that has been densified?
>
>
>
> On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P <[email protected]> wrote:
>
> > Hello,
> >
> > I am new to mahout. I want to know how I can get the list of features
> that
> > where extracted from the corpus by seq2sparse and the count of the total
> > number of features.
> >
> > My problem is that when I view the clustering output using clusterdumper
> I
> > get only dense vectors for each point that belongs in the cluster but I
> > want the sparse vector for each point. What I want to know is that are
> the
> > vectors output from the clustering algorithm stored as dense vector or is
> > the clusterdumper converting the vectors to dense vectors. If the
> > clustering algorithm generates sparse vectors I can directly use them or
> > else I will have to convert the vectors from dense to sparse for which I
> > need the information mentioned in the above paragraph.
> >
> > Your suggestions on this are welcome.
> >
> > Thanks,
> > Ashvini
> >
>