Re: [Senseclusters-users] sense cluster questions

Ted Pedersen Fri, 08 May 2009 07:02:11 -0700

Hi Tino,

See my responses inline...


On Thu, May 7, 2009 at 4:08 AM, Javier Sanchez Monzon (Tino)
<[email protected]> wrote:
>
> Hi everybody,
>
> -i have some questions refering to senseclusters tools.  I hope there are not 
> so many.
>
> 0-maybe the main question is the following:
>
> -Is it possible to have an output like the following?  I don't matter from 
> which documents the words come from. I am only intersting in how the words 
> are related before the clustering and after it.
> is this Solution possible?
>  cluster0
> ----------
> word--(0.82)--word2
> word--(0.81)--word3
> word3--(0.72)--word2
> .....
>
> cluster1
> ----------
> word4--(0.82)--word9
> word6--(0.81)--word5
> word37--(0.72)--word6
> ......

I don't know if the output will be exactly as you describe, but you
can do word clustering using the --wordclust option in discriminate.pl
(or by checking word clustering in the web interface).

http://search.cpan.org/~tpederse/Text-SenseClusters-1.01/discriminate.pl#--wordclust

>
> 1-I look forward to determine scored relations between nouns and proper names 
> with the sense cluster tools.  I achieved this by using count.pl, combig.pl 
> and then statistics.pl.  I tested for the last only with the default 
> association measure: Maximum Likelihood ratio.  Is the Fisher measure better 
> in the case i am intersting infinding best co occurrences of the text corpus? 
> This solution is without clustering process.

In general there is not a single best measure for identifying
collocations - each of the different measures behaves a bit
differently, and the best thing to do is to experiment a bit and see
which measure is behaving in the way that best suits your application.
Fisher's test in general tends to find that quite a few pairs of words
are collocations (so it might be thought of as a high recall approach,
which can be helpful in some settings).

>
>
> 2-i did some experiments with count.pl, combig.pl statistics.pl(Log 
> likelihood ratio), wordvec.pl and vcluster(given a num of clusters) programs. 
>  With the report of clustering i ask to add the frequent item sets of each 
> cluster.
> How is this calculation of frequent itemsets done?  Are these words the most 
> often words of the cluster that appear together in the documents before 
> clustering?

I think these are based on the frequency in the cluster (although I'm
not entirely sure which output you are referring to here, so if you
could send some sample output that would help).

> 3-About Describing and Descrimnating features.  Let's say i ask for the best 
> 5 features for each cluster.
> cluster 1
> -----------
> Describing features(features that can appear on other clusters?): tv 40% 
> magazin30% show 29% stage 27% crowd 25%
> Discriminate features:(this features only appears in this cluster?)  
> ...............
>
> Is it possible then here to infer that tv and magazin and have someting like:
> word--(0.82)--word2
> word--(0.81)--word3
> word3--(0.72)--word2

I don't think you can infer too much about the relationship between tv
and magazine. What you can infer is that both tv and magazine occur
more often than you'd expect by chance in that cluster, and so that
they might tell you something about the contents of that cluster.

>
> 4-i understood that using count.pl, combig.pl, statstics.pl, wordvec.pl, 
> vcluster give a hard-clustering solution. Which other combination or setups i 
> should try in order to obtain a soft clustering solution?  For example to 
> having some words repeated in more than one cluster?  Consider for example 
> follwiing solution:
>
> cluster 0
> -----------
> word1 word 2 word3
> cluster 1
> -----------
> word1 word4 word2
>
> How can i achieve this?  Does scluster would do this?

When you are doing word clustering (--wordclust) each word will appear
in just one cluster (so it's a hard clustering solution). scluster
refers to similarity matrix clustering, and vcluster refers to vector
clustering...


> 5-when i use the clusterstopping.pl program it suggest in the most cases 
> (using the default stop measure pk3) to my opinion a little number of 
> clusters.  When i cluster with a number that is 2times grater than the 
> suggested i get like expected more precisely cluster repartition.  My 
> question here is: with which other stop clustering measure i should try with?

I would suggest trying the measure PK2. In general that seems to
perform pretty well. Also, the cluster stopping algorithm really
depends very much on the features you are using, so you might want to
experiment with whatever features you are using and how you identify
those.

>
>
> regards,
> Tino
> ps: Congratulations to Dr. Ted Pedersen for his promotion as associated 
> professor.
>

Thank you!

I hope this all helps. Please let us know if additional questions arise.

Good luck,
Ted
>
>
>
>
> ------------------------------------------------------------------------------
> The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
> production scanning environment may not be a perfect world - but thanks to
> Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
> Series Scanner you'll get full speed at 300 dpi even with all image
> processing features enabled. http://p.sf.net/sfu/kodak-com
> _______________________________________________
> senseclusters-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Re: [Senseclusters-users] sense cluster questions

Reply via email to