Hmmm, different field, more or less the same result, i.e. all labels are the same for each vector [1]. I also included the Cluster dump [2]. I'm suspecting a bug.
[1]
Top labels for Cluster 129022 containing 19186 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 5836.5347257247195 0 16867
from 5328.54616727354 0 15499
year 4925.276801970322 0 14400
state 4866.91887763422 0 14240
new 4011.6858639516868 0 11867
after 3882.1740732807666 0 11503
first 3002.5827110484242 0 8998
two 2984.1892275922 0 8945
unit 2930.794111499563 0 8791
one 2686.95768492762 0 8085
Top labels for Cluster 129119 containing 16043 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 4808.386086146813 0 16867
from 4390.346637147013 0 15499
year 4058.4180186586455 0 14400
state 4010.379176544491 0 14240
new 3306.234930681996 0 11867
after 3199.5810555517673 0 11503
first 2475.079962851014 0 8998
two 2459.926843432244 0 8945
unit 2415.9376569474116 0 8791
one 2215.042654468678 0 8085
Top labels for Cluster 129191 containing 7770 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2243.2657141932286 0 16867
from 2048.755412856117 0 15499
year 1894.2384706358425 0 14400
state 1871.8704557279125 0 14240
new 1543.8513879175298 0 11867
after 1494.1429192917421 0 11503
first 1156.303048826754 0 8998
two 1149.2339147529565 0 8945
unit 1128.711646862328 0 8791
one 1034.9745452422649 0 8085
Top labels for Cluster 129302 containing 9426 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2741.316972494591 0 16867
from 2503.501101480797 0 15499
year 2314.5996575923637 0 14400
state 2287.255346294027 0 14240
new 1886.2961270781234 0 11867
after 1825.5399498036131 0 11503
first 1412.654560342431 0 8998
two 1404.0158626483753 0 8945
unit 1378.9371921028942 0 8791
one 1264.391515379306 0 8085
Top labels for Cluster 129360 containing 13092 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3870.8181769265793 0 16867
from 3534.623348234687 0 15499
year 3267.633215776179 0 14400
state 3228.989259615075 0 14240
new 2662.4551618834957 0 11867
after 2576.628638952039 0 11503
first 1993.499155438505 0 8998
two 1981.3008509986103 0 8945
unit 1945.8889682726003 0 8791
one 1784.1570986662991 0 8085
Top labels for Cluster 129371 containing 23944 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 7455.31941217836 0 16867
from 6805.274207816925 0 15499
year 6289.398677708115 0 14400
state 6214.757351316046 0 14240
new 5121.23683049297 0 11867
after 4955.695805796888 0 11503
first 3831.788851835765 0 8998
two 3808.2933898111805 0 8945
unit 3740.0891623105854 0 8791
one 3428.6551325367764 0 8085
Top labels for Cluster 129373 containing 9885 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2880.6778563517146 0 16867
from 2630.736483251676 0 15499
year 2432.208566541318 0 14400
state 2403.4711471684277 0 14240
new 1982.0948037123308 0 11867
after 1918.2465800205246 0 11503
first 1484.359997350257 0 8998
two 1475.282112147659 0 8945
unit 1448.9285028181039 0 8791
one 1328.560536378529 0 8085
Top labels for Cluster 129377 containing 11303 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3314.8890487886965 0 16867
from 3027.14497121796 0 15499
year 2798.608615776524 0 14400
state 2765.528720188886 0 14240
new 2280.5166378575377 0 11867
after 2207.0322705539875 0 11503
first 1707.7044410486706 0 8998
two 1697.2581536169164 0 8945
unit 1666.932174641639 0 8791
one 1528.4241032432765 0 8085
Top labels for Cluster 129381 containing 11411 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 3348.190782570746 0 16867
from 3057.545994592365 0 15499
year 2826.7072093421593 0 14400
state 2793.2941474220715 0 14240
new 2303.4001871203072 0 11867
after 2229.176642407663 0 11503
first 1724.8293614634313 0 8998
two 1714.2781240069307 0 8945
unit 1683.6474849330261 0 8791
one 1543.7481994605623 0 8085
Top labels for Cluster 129391 containing 7334 vectors
Term LLR In-ClusterDF Out-ClusterDF
has 2113.35227333894 0 16867
from 1930.1305988361128 0 15499
year 1784.577833758667 0 14400
state 1763.5072347805835 0 14240
new 1454.5072316131555 0 11867
after 1407.6797917694785 0 11503
first 1089.4127462548204 0 8998
two 1082.7530186888762 0 8945
unit 1063.4192575318739 0 8791
one 975.1101242941804 0 8085
[2]
:C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000,
0100:0.000, 0110:0.000,
Top Terms:
from =>0.022236135215980328
u => 0.01589135359475966
busi =>0.014789942880805335
bank =>0.014395075820558541
us => 0.01402954110138604
presid => 0.01341952961319183
month =>0.012118726267037198
about =>0.011986047971260612
compani =>0.011201454374207618
obama => 0.01105482429336391
:C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
0656:0.000, 07:0.000, 09:0.00
Top Terms:
citi => 0.04119064757467011
former =>0.030966538725529232
home =>0.029642735534519644
player => 0.02879703136878369
soccer => 0.01847372541986708
has =>0.015236681440174855
mark =>0.015185164518720528
new => 0.01266468154720074
polic => 0.01253454821409647
world =>0.011803315296178046
:C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, 0011:0.000,
0022:0.000, 003:0.000,
Top Terms:
4 =>0.027636996760550075
3 =>0.026093296145846434
1 => 0.02570191540464146
5 =>0.024807189589701305
2 =>0.023669513631826157
were =>0.021134415210709086
sunday =>0.017928504766147838
play =>0.017243683740808733
through =>0.017133336974828554
game =>0.017027790192043733
:C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
01:0.000, 011:0.000, 0112:
Top Terms:
new =>0.039501149799390206
peopl => 0.01933397797740685
world =>0.017478792605253438
could =>0.013495142418778704
has =>0.012987326502897916
more =>0.012585724039194569
from =>0.012242682917236177
face => 0.0117046220661272
leader =>0.011579584625370691
presid =>0.011192085113854965
:C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
005:0.000, 007:0.000, 008:
Top Terms:
state =>0.044732720259456946
unit =>0.032493582810588666
year =>0.025651340609304542
san =>0.025617706557963606
after =>0.022019046306438913
francisco =>0.020771004252363168
california => 0.01847124801606253
day =>0.015514125170527842
wednesday =>0.014587851421509652
citi =>0.012973538756014369
:C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
016:0.000, 02:0.000, 03:0.
Top Terms:
game => 0.04311022785679375
has => 0.03059922226267673
all =>0.027605073346921877
leagu => 0.0267627245855276
star => 0.02206632764439995
final =>0.020017765794918686
season => 0.01534931562714024
start => 0.01450896856938099
week =>0.014407234069110549
nation => 0.01429746391305699
:C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
016:0.000, 03:0.000, 034:0.
Top Terms:
coach => 0.05209277512761816
team =>0.031773971685165554
charg =>0.024246280249912454
from => 0.02093643936347752
has => 0.02057631329905952
week =>0.016848920922797363
last => 0.01674320150844955
program =>0.016023081209070564
former =>0.015872337289314063
after => 0.01341825692502786
:C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
006:0.000, 0065:0.000, 007
Top Terms:
been => 0.03757994091979662
time => 0.03591307497544333
first => 0.03422461795380875
has =>0.029800513863644906
feder =>0.027382680342986195
monday =>0.022174840523045594
sinc => 0.02185219249613946
year => 0.01933420097135394
from => 0.01162537888358458
state =>0.009756869426688311
:C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
0011:0.000, 002:0.000, 0051:0
Top Terms:
win => 0.03267669747239372
one =>0.031009191445456212
second =>0.028066582472705007
three =>0.026147346665631184
out => 0.0226123748207931
shot =>0.020446190395276405
last =>0.019624841184867056
night =>0.019103407305052604
over =>0.017376642133669604
year =>0.016475201865715022
:C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
0123:0.000, 02:0.000, 0213
Top Terms:
championship =>0.035449579372280104
run =>0.026446073370591447
art => 0.02489330236372834
open => 0.02282619503375418
place =>0.022410914360311056
grand => 0.0169734705340118
reuter =>0.015895311339829302
6 =>0.015700075983436933
continu =>0.015418929721703813
slam =>0.012102435338420274
-Grant
On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
> Can't say just off-hand.
>
> What is the data?
>
> On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]> wrote:
>
>> I'm running ClusterLabels and it seems to be outputting the same values for
>> every centroid [1]. When I run the cluster dumper, the top terms are fairly
>> different for those same vectors.
>>
>> Have I hit a vagary of LLR or is this a bug?
>>
>>
>> Thanks,
>> Grant
>>
>>
>> [1]
>> <snip>
>> Top labels for Cluster 129062 containing 22710 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 43269.00830466254 0 72060
>> his 7185.503760070074 0 17203
>> has 7028.243643655442 0 16855
>> from 6415.739411605988 0 15488
>> year 5930.141497239005 0 14391
>> state 5858.43069797568 0 14228
>> said 5616.422720833216 0 13676
>> it 5545.207108973991 0 13513
>> he 5239.340392438695 0 12810
>> new 4830.124521905556 0 11862
>>
>> Top labels for Cluster 129145 containing 11188 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19576.26998734614 0 72060
>> his 3352.5135342599824 0 17203
>> has 3279.466228939127 0 16855
>> from 2994.8128935270943 0 15488
>> year 2768.974903047085 0 14391
>> state 2735.612128134351 0 14228
>> said 2622.997358441353 0 13676
>> it 2589.8515553446487 0 13513
>> he 2447.4579147226177 0 12810
>> new 2256.8640938592143 0 11862
>>
>> Top labels for Cluster 129201 containing 13040 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 23110.173012922285 0 72060
>> his 3940.4691014224663 0 17203
>> has 3854.554399965331 0 16855
>> from 3519.784154796507 0 15488
>> year 3254.2127395244315 0 14391
>> state 3214.9822960514575 0 14228
>> said 3082.565408431459 0 13676
>> it 3043.5924300444312 0 13513
>> he 2876.171367166564 0 12810
>> new 2652.0934832417406 0 11862
>>
>> Top labels for Cluster 129211 containing 14053 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 25083.46391701023 0 72060
>> his 4266.378291217145 0 17203
>> has 4173.323467798065 0 16855
>> from 3810.7467373879626 0 15488
>> year 3523.1337431534193 0 14391
>> state 3480.648573280778 0 14228
>> said 3337.2482196930796 0 13676
>> it 3295.0432900944725 0 13513
>> he 3113.741967030335 0 12810
>> new 2871.0957860480994 0 11862
>>
>> Top labels for Cluster 129242 containing 12861 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 22764.503256496973 0 72060
>> his 3883.2002838114277 0 17203
>> has 3798.5396822127514 0 16855
>> from 3468.6536546614952 0 15488
>> year 3206.954131908249 0 14391
>> state 3168.2954448102973 0 14228
>> said 3037.808057511691 0 13676
>> it 2999.402857856825 0 13513
>> he 2834.4202939094976 0 12810
>> new 2613.604658874683 0 11862
>>
>> Top labels for Cluster 129245 containing 6443 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 10925.268199045677 0 72060
>> his 1890.511348863598 0 17203
>> has 1849.385320336558 0 16855
>> from 1689.0946326381527 0 15488
>> year 1561.8904545903206 0 14391
>> state 1543.096286157146 0 14228
>> said 1479.652662154287 0 13676
>> it 1460.9780013803393 0 13513
>> he 1380.745082413312 0 12810
>> new 1273.3357145632617 0 11862
>>
>> Top labels for Cluster 129255 containing 11390 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 19957.211259535048 0 72060
>> his 3416.1555761522613 0 17203
>> has 3341.7163103362545 0 16855
>> from 3051.6410844950005 0 15488
>> year 2821.504116652999 0 14391
>> state 2787.5064550531097 0 14228
>> said 2672.7490201727487 0 13676
>> it 2638.972676954698 0 13513
>> he 2493.870809029322 0 12810
>> new 2299.653438703157 0 11862
>>
>> Top labels for Cluster 129265 containing 9461 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 16362.85457371641 0 72060
>> his 2813.167819214519 0 17203
>> has 2751.908798408229 0 16855
>> from 2513.176188033074 0 15488
>> year 2323.752471229993 0 14391
>> state 2295.767774611246 0 14228
>> said 2201.3039346230216 0 13676
>> it 2173.4997256915085 0 13513
>> he 2054.0495802331716 0 12810
>> new 1894.1558320098557 0 11862
>>
>> Top labels for Cluster 129279 containing 14559 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 26080.197364640888 0 72060
>> his 4430.338072712999 0 17203
>> has 4333.689091425855 0 16855
>> from 3957.116204748396 0 15488
>> year 3658.40981121175 0 14391
>> state 3614.286633652635 0 14228
>> said 3465.358771919273 0 13676
>> it 3421.527382406406 0 13513
>> he 3233.2411222746596 0 12810
>> new 2981.251407010015 0 11862
>>
>> Top labels for Cluster 129290 containing 13592 vectors
>> Term LLR In-ClusterDF Out-ClusterDF
>> a 24181.82589298836 0 72060
>> his 4117.6785482652485 0 17203
>> has 4027.8821644652635 0 16855
>> from 3677.9947950267233 0 15488
>> year 3400.440033295192 0 14391
>> state 3359.4400672735646 0 14228
>> said 3221.0516651300713 0 13676
>> it 3180.321518546436 0 13513
>> he 3005.353873868007 0 12810
>> new 2771.180380204227 0 11862
>> </snip>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search
