On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote: > Can't say just off-hand. > > What is the data?
Small docs, title and description, taken from RSS feeds from 20 or so news sites. Hmm, looks like I created my docs from the wrong field (there shouldn't be stopwords like those below). Let me re-run and I'll report back. > > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]> wrote: > >> I'm running ClusterLabels and it seems to be outputting the same values for >> every centroid [1]. When I run the cluster dumper, the top terms are fairly >> different for those same vectors. >> >> Have I hit a vagary of LLR or is this a bug? >> >> >> Thanks, >> Grant >> >> >> [1] >> <snip> >> Top labels for Cluster 129062 containing 22710 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 43269.00830466254 0 72060 >> his 7185.503760070074 0 17203 >> has 7028.243643655442 0 16855 >> from 6415.739411605988 0 15488 >> year 5930.141497239005 0 14391 >> state 5858.43069797568 0 14228 >> said 5616.422720833216 0 13676 >> it 5545.207108973991 0 13513 >> he 5239.340392438695 0 12810 >> new 4830.124521905556 0 11862 >> >> Top labels for Cluster 129145 containing 11188 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 19576.26998734614 0 72060 >> his 3352.5135342599824 0 17203 >> has 3279.466228939127 0 16855 >> from 2994.8128935270943 0 15488 >> year 2768.974903047085 0 14391 >> state 2735.612128134351 0 14228 >> said 2622.997358441353 0 13676 >> it 2589.8515553446487 0 13513 >> he 2447.4579147226177 0 12810 >> new 2256.8640938592143 0 11862 >> >> Top labels for Cluster 129201 containing 13040 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 23110.173012922285 0 72060 >> his 3940.4691014224663 0 17203 >> has 3854.554399965331 0 16855 >> from 3519.784154796507 0 15488 >> year 3254.2127395244315 0 14391 >> state 3214.9822960514575 0 14228 >> said 3082.565408431459 0 13676 >> it 3043.5924300444312 0 13513 >> he 2876.171367166564 0 12810 >> new 2652.0934832417406 0 11862 >> >> Top labels for Cluster 129211 containing 14053 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 25083.46391701023 0 72060 >> his 4266.378291217145 0 17203 >> has 4173.323467798065 0 16855 >> from 3810.7467373879626 0 15488 >> year 3523.1337431534193 0 14391 >> state 3480.648573280778 0 14228 >> said 3337.2482196930796 0 13676 >> it 3295.0432900944725 0 13513 >> he 3113.741967030335 0 12810 >> new 2871.0957860480994 0 11862 >> >> Top labels for Cluster 129242 containing 12861 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 22764.503256496973 0 72060 >> his 3883.2002838114277 0 17203 >> has 3798.5396822127514 0 16855 >> from 3468.6536546614952 0 15488 >> year 3206.954131908249 0 14391 >> state 3168.2954448102973 0 14228 >> said 3037.808057511691 0 13676 >> it 2999.402857856825 0 13513 >> he 2834.4202939094976 0 12810 >> new 2613.604658874683 0 11862 >> >> Top labels for Cluster 129245 containing 6443 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 10925.268199045677 0 72060 >> his 1890.511348863598 0 17203 >> has 1849.385320336558 0 16855 >> from 1689.0946326381527 0 15488 >> year 1561.8904545903206 0 14391 >> state 1543.096286157146 0 14228 >> said 1479.652662154287 0 13676 >> it 1460.9780013803393 0 13513 >> he 1380.745082413312 0 12810 >> new 1273.3357145632617 0 11862 >> >> Top labels for Cluster 129255 containing 11390 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 19957.211259535048 0 72060 >> his 3416.1555761522613 0 17203 >> has 3341.7163103362545 0 16855 >> from 3051.6410844950005 0 15488 >> year 2821.504116652999 0 14391 >> state 2787.5064550531097 0 14228 >> said 2672.7490201727487 0 13676 >> it 2638.972676954698 0 13513 >> he 2493.870809029322 0 12810 >> new 2299.653438703157 0 11862 >> >> Top labels for Cluster 129265 containing 9461 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 16362.85457371641 0 72060 >> his 2813.167819214519 0 17203 >> has 2751.908798408229 0 16855 >> from 2513.176188033074 0 15488 >> year 2323.752471229993 0 14391 >> state 2295.767774611246 0 14228 >> said 2201.3039346230216 0 13676 >> it 2173.4997256915085 0 13513 >> he 2054.0495802331716 0 12810 >> new 1894.1558320098557 0 11862 >> >> Top labels for Cluster 129279 containing 14559 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 26080.197364640888 0 72060 >> his 4430.338072712999 0 17203 >> has 4333.689091425855 0 16855 >> from 3957.116204748396 0 15488 >> year 3658.40981121175 0 14391 >> state 3614.286633652635 0 14228 >> said 3465.358771919273 0 13676 >> it 3421.527382406406 0 13513 >> he 3233.2411222746596 0 12810 >> new 2981.251407010015 0 11862 >> >> Top labels for Cluster 129290 containing 13592 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> a 24181.82589298836 0 72060 >> his 4117.6785482652485 0 17203 >> has 4027.8821644652635 0 16855 >> from 3677.9947950267233 0 15488 >> year 3400.440033295192 0 14391 >> state 3359.4400672735646 0 14228 >> said 3221.0516651300713 0 13676 >> it 3180.321518546436 0 13513 >> he 3005.353873868007 0 12810 >> new 2771.180380204227 0 11862 >> </snip> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
