Hi Tino, See my responses inline...
On Thu, May 7, 2009 at 4:08 AM, Javier Sanchez Monzon (Tino) <[email protected]> wrote: > > Hi everybody, > > -i have some questions refering to senseclusters tools. I hope there are not > so many. > > 0-maybe the main question is the following: > > -Is it possible to have an output like the following? I don't matter from > which documents the words come from. I am only intersting in how the words > are related before the clustering and after it. > is this Solution possible? > cluster0 > ---------- > word--(0.82)--word2 > word--(0.81)--word3 > word3--(0.72)--word2 > ..... > > cluster1 > ---------- > word4--(0.82)--word9 > word6--(0.81)--word5 > word37--(0.72)--word6 > ...... I don't know if the output will be exactly as you describe, but you can do word clustering using the --wordclust option in discriminate.pl (or by checking word clustering in the web interface). http://search.cpan.org/~tpederse/Text-SenseClusters-1.01/discriminate.pl#--wordclust > > 1-I look forward to determine scored relations between nouns and proper names > with the sense cluster tools. I achieved this by using count.pl, combig.pl > and then statistics.pl. I tested for the last only with the default > association measure: Maximum Likelihood ratio. Is the Fisher measure better > in the case i am intersting infinding best co occurrences of the text corpus? > This solution is without clustering process. In general there is not a single best measure for identifying collocations - each of the different measures behaves a bit differently, and the best thing to do is to experiment a bit and see which measure is behaving in the way that best suits your application. Fisher's test in general tends to find that quite a few pairs of words are collocations (so it might be thought of as a high recall approach, which can be helpful in some settings). > > > 2-i did some experiments with count.pl, combig.pl statistics.pl(Log > likelihood ratio), wordvec.pl and vcluster(given a num of clusters) programs. > With the report of clustering i ask to add the frequent item sets of each > cluster. > How is this calculation of frequent itemsets done? Are these words the most > often words of the cluster that appear together in the documents before > clustering? I think these are based on the frequency in the cluster (although I'm not entirely sure which output you are referring to here, so if you could send some sample output that would help). > 3-About Describing and Descrimnating features. Let's say i ask for the best > 5 features for each cluster. > cluster 1 > ----------- > Describing features(features that can appear on other clusters?): tv 40% > magazin30% show 29% stage 27% crowd 25% > Discriminate features:(this features only appears in this cluster?) > ............... > > Is it possible then here to infer that tv and magazin and have someting like: > word--(0.82)--word2 > word--(0.81)--word3 > word3--(0.72)--word2 I don't think you can infer too much about the relationship between tv and magazine. What you can infer is that both tv and magazine occur more often than you'd expect by chance in that cluster, and so that they might tell you something about the contents of that cluster. > > 4-i understood that using count.pl, combig.pl, statstics.pl, wordvec.pl, > vcluster give a hard-clustering solution. Which other combination or setups i > should try in order to obtain a soft clustering solution? For example to > having some words repeated in more than one cluster? Consider for example > follwiing solution: > > cluster 0 > ----------- > word1 word 2 word3 > cluster 1 > ----------- > word1 word4 word2 > > How can i achieve this? Does scluster would do this? When you are doing word clustering (--wordclust) each word will appear in just one cluster (so it's a hard clustering solution). scluster refers to similarity matrix clustering, and vcluster refers to vector clustering... > 5-when i use the clusterstopping.pl program it suggest in the most cases > (using the default stop measure pk3) to my opinion a little number of > clusters. When i cluster with a number that is 2times grater than the > suggested i get like expected more precisely cluster repartition. My > question here is: with which other stop clustering measure i should try with? I would suggest trying the measure PK2. In general that seems to perform pretty well. Also, the cluster stopping algorithm really depends very much on the features you are using, so you might want to experiment with whatever features you are using and how you identify those. > > > regards, > Tino > ps: Congratulations to Dr. Ted Pedersen for his promotion as associated > professor. > Thank you! I hope this all helps. Please let us know if additional questions arise. Good luck, Ted > > > > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > senseclusters-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
