Hi frank, no, no collocation job. You just take a big enough sample of documents and assign it to it's cluster with the learned ClusterClassifier. Parallel to that you count the total words in a guava multiset and the per cluster word counts in a multiset. The LogLikelihood class contains a convenient method that takes two multisets that you use for all clusters.
there should be no need in starting a map reduce job for that, with some ram you can just stream the documents from the hdfs On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten <[email protected]>wrote: > Hi Johannes, > > Sounds good. > > The step for finding labels is still unclear to me. You use the > Loglikelihood class on the original documents? How? Or do you mean the > collocation job? > > Cheers, > > Frank > > > > > > > > On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte < > [email protected]> wrote: > > > Hi Frank, we are using a very similar system in production. > > Hashing text like data to a 50000 dimensional vector with two probes, and > > then applying tf-idf weighting. > > > > For IDF we dont keep a separate weight dictionary but just count the > > distinct training examples ("documents") that have a non null value per > > column. > > so there is a full idf vector that can be used. > > Instead of Euclidean Distance we use Cosine (Performance Reasons). > > > > The results are very good, building such a system is easy and maybe it's > > worth a try. > > > > For representing the cluster we have a separate job that assigns users > > ("documents") to clusters and shows the most discriminating words for the > > cluster via the LogLikelihood class. The results are then visualized > using > > http://wordcram.org/ for the whoah effect. > > > > Cheers, > > > > Johannes > > > > > > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <[email protected]> > > wrote: > > > > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten < > [email protected] > > > >wrote: > > > > > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <[email protected] > > > > > > wrote: > > > > > > > > > Yes. Hashing vector encoders will preserve distances when used > with > > > > > multiple probes. > > > > > > > > > > > > > So if a token occurs two times in a document the first token will be > > > mapped > > > > to a given location and when the token is hashed the second time it > > will > > > be > > > > mapped to a different location, right? > > > > > > > > > > No. The same token will always hash to the same location(s). > > > > > > > > > > I am wondering if when many probes are used and a large enough vector > > > this > > > > process mimics TF weighting, since documents that have a high TF of a > > > given > > > > token will have the same positions marked in the vector. As Suneel > said > > > > when we then use the Hamming distance the vectors that are close to > > each > > > > other should be in the same cluster. > > > > > > > > > > Hamming distance doesn't quite work because you want to have collisions > > to > > > a sum rather than an OR. Also, if you apply weights to the words, > these > > > weights will be added to all of the probe locations for the words. > This > > > means we still need a plus/times/L2 dot product rather than an > > plus/AND/L1 > > > dot product like the Hamming distance uses. > > > > > > > > > > > > Interpretation becomes somewhat difficult, but there is code > > available > > > to > > > > > reverse engineer labels on hashed vectors. > > > > > > > > > > > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see > > > which > > > > words it has seen but I don't see how to go from a position or > several > > > > positions in the vector to labels. Is there an example in the code I > > can > > > > look at? > > > > > > > > > > Yes. The newsgroups example applies. > > > > > > The AdaptiveWordEncoder counts word occurrences that it sees and uses > the > > > IDF based on the resulting counts. This assumes that all instances of > > the > > > AWE will see the same rough distribution of words to work. It is fine > > for > > > lots of applications and not fine for lots. > > > > > > > > > > > > > > > > > > > IDF weighting is slightly tricky, but quite doable if you keep a > > > > dictionary > > > > > of, say, the most common 50-200 thousand words and assume all > others > > > have > > > > > constant and equal frequency. > > > > > > > > > > > > > How would IDF weighting work in conjunction with hashing? First build > > up > > > a > > > > dictionary of 50-200 and pass that into the vector encoders? The > > drawback > > > > of this is that you have another pass through the data and another > > > 'input' > > > > to keep track of and configure. But maybe it has to be like that. > > > > > > > > > With hashing, you still have the option of applying a weight to the > > hashed > > > representation of each word. The question is what weight. > > > > > > To build a small dictionary, you don't have to go through all of the > > data. > > > Just enough to get reasonably accurate weights for most words. All > > words > > > not yet seen can be assumed to be rare and thus get the nominal > > "rare-word" > > > weight. > > > > > > Keeping track of the dictionary of weights is, indeed, a pain. > > > > > > > > > > > > > The > > > > reason I like the hashed encoders is that vectorizing can be done in > a > > > > streaming manner at the last possible moment. With the current tools > > you > > > > have to do: data -> data2seq -> seq2sparse -> kmeans. > > > > > > > > > > Indeed. That is the great virtue. > > > > > > > > > > > > > > If this approach is doable I would like to code up a Java non-Hadoop > > > > example using the Reuters dataset which vectorizes each doc using the > > > > hashing encoders, configures KMeans with Hamming distance and then > > write > > > > some code to get the labels. > > > > > > > > > > Use Euclidean distance, not Hamming. > > > > > > You can definitely use the AWE here if you randomize document ordering. > > > > > >
