On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <[email protected]> wrote:

> Yes.  Hashing vector encoders will preserve distances when used with
> multiple probes.
>

So if a token occurs two times in a document the first token will be mapped
to a given location and when the token is hashed the second time it will be
mapped to a different location, right?

I am wondering if when many probes are used and a large enough vector this
process mimics TF weighting, since documents that have a high TF of a given
token will have the same positions marked in the vector. As Suneel said
when we then use the Hamming distance the vectors that are close to each
other should be in the same cluster.


>
> Interpretation becomes somewhat difficult, but there is code available to
> reverse engineer labels on hashed vectors.


I saw that AdaptiveWordEncoder has a built in dictionary so I can see which
words it has seen but I don't see how to go from a position or several
positions in the vector to labels. Is there an example in the code I can
look at?


> IDF weighting is slightly tricky, but quite doable if you keep a dictionary
> of, say, the most common 50-200 thousand words and assume all others have
> constant and equal frequency.
>

How would IDF weighting work in conjunction with hashing? First build up a
dictionary of 50-200 and pass that into the vector encoders? The drawback
of this is that you have another pass through the data and another 'input'
to keep track of and configure. But maybe it has to be like that. The
reason I like the hashed encoders is that vectorizing can be done in a
streaming manner at the last possible moment. With the current tools you
have to do: data -> data2seq -> seq2sparse -> kmeans.

If this approach is doable I would like to code up a Java non-Hadoop
example using the Reuters dataset which vectorizes each doc using the
hashing encoders, configures KMeans with Hamming distance and then write
some code to get the labels.

Cheers,

Frank


>
>
>
> On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten <[email protected]
> >wrote:
>
> > Hi all,
> >
> > Would it be possible to use hashing vector encoders for text clustering
> > just like when classifying?
> >
> > Currently we vectorize using a dictionary where we map each token to a
> > fixed position in the dictionary. After the clustering we use have to
> > retrieve the dictionary to determine the cluster labels.
> > This is quite a complex process where multiple outputs are read and
> written
> > in the entire clustering process.
> >
> > I think it would be great if both algorithms could use the same encoding
> > process but I don't know if this is possible.
> >
> > The problem is that we lose the mapping between token and position when
> > hashing. We need this mapping to determine cluster labels.
> >
> > However, maybe we could make it so hashed encoders can be used and that
> > determining top labels is left to the user. This might be a possibility
> > because I noticed a problem with the current cluster labeling code. This
> is
> > what happens: first vectors are vectorized with TF-IDF and clustered.
> Then
> > the labels are ranked, but again according to TF-IDF, instead of TF. So
> it
> > is possible that a token becomes the top ranked label, even though it is
> > rare within the cluster. The document with that token is in the cluster
> > because of other tokens. If the labels are determined based on a TF score
> > within the cluster I think you would have better labels. But this
> requires
> > a post-processing step on your original data and doing a TF count.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Reply via email to