OK, seems the problem is ClusterLabels was never updated when we switched over to WeightedVectorWritable and it also seems like somewhere in the equation of KMeans being run that we lost the NamedVector again, as the clusteredPoints directory does not contain NamedVectors, even though that is what I created the original points as when starting.
On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <[email protected]> wrote: > Hmmm, different field, more or less the same result, i.e. all labels are > the same for each vector [1]. I also included the Cluster dump [2]. I'm > suspecting a bug. > > [1] > Top labels for Cluster 129022 containing 19186 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 5836.5347257247195 0 16867 > from 5328.54616727354 0 15499 > year 4925.276801970322 0 14400 > state 4866.91887763422 0 14240 > new 4011.6858639516868 0 11867 > after 3882.1740732807666 0 11503 > first 3002.5827110484242 0 8998 > two 2984.1892275922 0 8945 > unit 2930.794111499563 0 8791 > one 2686.95768492762 0 8085 > > Top labels for Cluster 129119 containing 16043 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 4808.386086146813 0 16867 > from 4390.346637147013 0 15499 > year 4058.4180186586455 0 14400 > state 4010.379176544491 0 14240 > new 3306.234930681996 0 11867 > after 3199.5810555517673 0 11503 > first 2475.079962851014 0 8998 > two 2459.926843432244 0 8945 > unit 2415.9376569474116 0 8791 > one 2215.042654468678 0 8085 > > Top labels for Cluster 129191 containing 7770 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2243.2657141932286 0 16867 > from 2048.755412856117 0 15499 > year 1894.2384706358425 0 14400 > state 1871.8704557279125 0 14240 > new 1543.8513879175298 0 11867 > after 1494.1429192917421 0 11503 > first 1156.303048826754 0 8998 > two 1149.2339147529565 0 8945 > unit 1128.711646862328 0 8791 > one 1034.9745452422649 0 8085 > > Top labels for Cluster 129302 containing 9426 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2741.316972494591 0 16867 > from 2503.501101480797 0 15499 > year 2314.5996575923637 0 14400 > state 2287.255346294027 0 14240 > new 1886.2961270781234 0 11867 > after 1825.5399498036131 0 11503 > first 1412.654560342431 0 8998 > two 1404.0158626483753 0 8945 > unit 1378.9371921028942 0 8791 > one 1264.391515379306 0 8085 > > Top labels for Cluster 129360 containing 13092 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3870.8181769265793 0 16867 > from 3534.623348234687 0 15499 > year 3267.633215776179 0 14400 > state 3228.989259615075 0 14240 > new 2662.4551618834957 0 11867 > after 2576.628638952039 0 11503 > first 1993.499155438505 0 8998 > two 1981.3008509986103 0 8945 > unit 1945.8889682726003 0 8791 > one 1784.1570986662991 0 8085 > > Top labels for Cluster 129371 containing 23944 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 7455.31941217836 0 16867 > from 6805.274207816925 0 15499 > year 6289.398677708115 0 14400 > state 6214.757351316046 0 14240 > new 5121.23683049297 0 11867 > after 4955.695805796888 0 11503 > first 3831.788851835765 0 8998 > two 3808.2933898111805 0 8945 > unit 3740.0891623105854 0 8791 > one 3428.6551325367764 0 8085 > > Top labels for Cluster 129373 containing 9885 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2880.6778563517146 0 16867 > from 2630.736483251676 0 15499 > year 2432.208566541318 0 14400 > state 2403.4711471684277 0 14240 > new 1982.0948037123308 0 11867 > after 1918.2465800205246 0 11503 > first 1484.359997350257 0 8998 > two 1475.282112147659 0 8945 > unit 1448.9285028181039 0 8791 > one 1328.560536378529 0 8085 > > Top labels for Cluster 129377 containing 11303 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3314.8890487886965 0 16867 > from 3027.14497121796 0 15499 > year 2798.608615776524 0 14400 > state 2765.528720188886 0 14240 > new 2280.5166378575377 0 11867 > after 2207.0322705539875 0 11503 > first 1707.7044410486706 0 8998 > two 1697.2581536169164 0 8945 > unit 1666.932174641639 0 8791 > one 1528.4241032432765 0 8085 > > Top labels for Cluster 129381 containing 11411 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3348.190782570746 0 16867 > from 3057.545994592365 0 15499 > year 2826.7072093421593 0 14400 > state 2793.2941474220715 0 14240 > new 2303.4001871203072 0 11867 > after 2229.176642407663 0 11503 > first 1724.8293614634313 0 8998 > two 1714.2781240069307 0 8945 > unit 1683.6474849330261 0 8791 > one 1543.7481994605623 0 8085 > > Top labels for Cluster 129391 containing 7334 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2113.35227333894 0 16867 > from 1930.1305988361128 0 15499 > year 1784.577833758667 0 14400 > state 1763.5072347805835 0 14240 > new 1454.5072316131555 0 11867 > after 1407.6797917694785 0 11503 > first 1089.4127462548204 0 8998 > two 1082.7530186888762 0 8945 > unit 1063.4192575318739 0 8791 > one 975.1101242941804 0 8085 > > [2] > :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, > 01:0.000, 0100:0.000, 0110:0.000, > Top Terms: > from > =>0.022236135215980328 > u => > 0.01589135359475966 > busi > =>0.014789942880805335 > bank > =>0.014395075820558541 > us => > 0.01402954110138604 > presid => > 0.01341952961319183 > month > =>0.012118726267037198 > about > =>0.011986047971260612 > compani > =>0.011201454374207618 > obama => > 0.01105482429336391 > :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, > 0656:0.000, 07:0.000, 09:0.00 > Top Terms: > citi => > 0.04119064757467011 > former > =>0.030966538725529232 > home > =>0.029642735534519644 > player => > 0.02879703136878369 > soccer => > 0.01847372541986708 > has > =>0.015236681440174855 > mark > =>0.015185164518720528 > new => > 0.01266468154720074 > polic => > 0.01253454821409647 > world > =>0.011803315296178046 > :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, > 0011:0.000, 0022:0.000, 003:0.000, > Top Terms: > 4 > =>0.027636996760550075 > 3 > =>0.026093296145846434 > 1 => > 0.02570191540464146 > 5 > =>0.024807189589701305 > 2 > =>0.023669513631826157 > were > =>0.021134415210709086 > sunday > =>0.017928504766147838 > play > =>0.017243683740808733 > through > =>0.017133336974828554 > game > =>0.017027790192043733 > :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, > 01:0.000, 011:0.000, 0112: > Top Terms: > new > =>0.039501149799390206 > peopl => > 0.01933397797740685 > world > =>0.017478792605253438 > could > =>0.013495142418778704 > has > =>0.012987326502897916 > more > =>0.012585724039194569 > from > =>0.012242682917236177 > face => > 0.0117046220661272 > leader > =>0.011579584625370691 > presid > =>0.011192085113854965 > :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, > 005:0.000, 007:0.000, 008: > Top Terms: > state > =>0.044732720259456946 > unit > =>0.032493582810588666 > year > =>0.025651340609304542 > san > =>0.025617706557963606 > after > =>0.022019046306438913 > francisco > =>0.020771004252363168 > california => > 0.01847124801606253 > day > =>0.015514125170527842 > wednesday > =>0.014587851421509652 > citi > =>0.012973538756014369 > :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, > 016:0.000, 02:0.000, 03:0. > Top Terms: > game => > 0.04311022785679375 > has => > 0.03059922226267673 > all > =>0.027605073346921877 > leagu => > 0.0267627245855276 > star => > 0.02206632764439995 > final > =>0.020017765794918686 > season => > 0.01534931562714024 > start => > 0.01450896856938099 > week > =>0.014407234069110549 > nation => > 0.01429746391305699 > :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, > 016:0.000, 03:0.000, 034:0. > Top Terms: > coach => > 0.05209277512761816 > team > =>0.031773971685165554 > charg > =>0.024246280249912454 > from => > 0.02093643936347752 > has => > 0.02057631329905952 > week > =>0.016848920922797363 > last => > 0.01674320150844955 > program > =>0.016023081209070564 > former > =>0.015872337289314063 > after => > 0.01341825692502786 > :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, > 006:0.000, 0065:0.000, 007 > Top Terms: > been => > 0.03757994091979662 > time => > 0.03591307497544333 > first => > 0.03422461795380875 > has > =>0.029800513863644906 > feder > =>0.027382680342986195 > monday > =>0.022174840523045594 > sinc => > 0.02185219249613946 > year => > 0.01933420097135394 > from => > 0.01162537888358458 > state > =>0.009756869426688311 > :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, > 0011:0.000, 002:0.000, 0051:0 > Top Terms: > win => > 0.03267669747239372 > one > =>0.031009191445456212 > second > =>0.028066582472705007 > three > =>0.026147346665631184 > out => > 0.0226123748207931 > shot > =>0.020446190395276405 > last > =>0.019624841184867056 > night > =>0.019103407305052604 > over > =>0.017376642133669604 > year > =>0.016475201865715022 > :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, > 0123:0.000, 02:0.000, 0213 > Top Terms: > championship > =>0.035449579372280104 > run > =>0.026446073370591447 > art => > 0.02489330236372834 > open => > 0.02282619503375418 > place > =>0.022410914360311056 > grand => > 0.0169734705340118 > reuter > =>0.015895311339829302 > 6 > =>0.015700075983436933 > continu > =>0.015418929721703813 > slam > =>0.012102435338420274 > > > -Grant > On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote: > > > Can't say just off-hand. > > > > What is the data? > > > > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]> > wrote: > > > >> I'm running ClusterLabels and it seems to be outputting the same values > for > >> every centroid [1]. When I run the cluster dumper, the top terms are > fairly > >> different for those same vectors. > >> > >> Have I hit a vagary of LLR or is this a bug? > >> > >> > >> Thanks, > >> Grant > >> > >> > >> [1] > >> <snip> > >> Top labels for Cluster 129062 containing 22710 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 43269.00830466254 0 72060 > >> his 7185.503760070074 0 17203 > >> has 7028.243643655442 0 16855 > >> from 6415.739411605988 0 15488 > >> year 5930.141497239005 0 14391 > >> state 5858.43069797568 0 14228 > >> said 5616.422720833216 0 13676 > >> it 5545.207108973991 0 13513 > >> he 5239.340392438695 0 12810 > >> new 4830.124521905556 0 11862 > >> > >> Top labels for Cluster 129145 containing 11188 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 19576.26998734614 0 72060 > >> his 3352.5135342599824 0 17203 > >> has 3279.466228939127 0 16855 > >> from 2994.8128935270943 0 15488 > >> year 2768.974903047085 0 14391 > >> state 2735.612128134351 0 14228 > >> said 2622.997358441353 0 13676 > >> it 2589.8515553446487 0 13513 > >> he 2447.4579147226177 0 12810 > >> new 2256.8640938592143 0 11862 > >> > >> Top labels for Cluster 129201 containing 13040 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 23110.173012922285 0 72060 > >> his 3940.4691014224663 0 17203 > >> has 3854.554399965331 0 16855 > >> from 3519.784154796507 0 15488 > >> year 3254.2127395244315 0 14391 > >> state 3214.9822960514575 0 14228 > >> said 3082.565408431459 0 13676 > >> it 3043.5924300444312 0 13513 > >> he 2876.171367166564 0 12810 > >> new 2652.0934832417406 0 11862 > >> > >> Top labels for Cluster 129211 containing 14053 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 25083.46391701023 0 72060 > >> his 4266.378291217145 0 17203 > >> has 4173.323467798065 0 16855 > >> from 3810.7467373879626 0 15488 > >> year 3523.1337431534193 0 14391 > >> state 3480.648573280778 0 14228 > >> said 3337.2482196930796 0 13676 > >> it 3295.0432900944725 0 13513 > >> he 3113.741967030335 0 12810 > >> new 2871.0957860480994 0 11862 > >> > >> Top labels for Cluster 129242 containing 12861 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 22764.503256496973 0 72060 > >> his 3883.2002838114277 0 17203 > >> has 3798.5396822127514 0 16855 > >> from 3468.6536546614952 0 15488 > >> year 3206.954131908249 0 14391 > >> state 3168.2954448102973 0 14228 > >> said 3037.808057511691 0 13676 > >> it 2999.402857856825 0 13513 > >> he 2834.4202939094976 0 12810 > >> new 2613.604658874683 0 11862 > >> > >> Top labels for Cluster 129245 containing 6443 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 10925.268199045677 0 72060 > >> his 1890.511348863598 0 17203 > >> has 1849.385320336558 0 16855 > >> from 1689.0946326381527 0 15488 > >> year 1561.8904545903206 0 14391 > >> state 1543.096286157146 0 14228 > >> said 1479.652662154287 0 13676 > >> it 1460.9780013803393 0 13513 > >> he 1380.745082413312 0 12810 > >> new 1273.3357145632617 0 11862 > >> > >> Top labels for Cluster 129255 containing 11390 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 19957.211259535048 0 72060 > >> his 3416.1555761522613 0 17203 > >> has 3341.7163103362545 0 16855 > >> from 3051.6410844950005 0 15488 > >> year 2821.504116652999 0 14391 > >> state 2787.5064550531097 0 14228 > >> said 2672.7490201727487 0 13676 > >> it 2638.972676954698 0 13513 > >> he 2493.870809029322 0 12810 > >> new 2299.653438703157 0 11862 > >> > >> Top labels for Cluster 129265 containing 9461 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 16362.85457371641 0 72060 > >> his 2813.167819214519 0 17203 > >> has 2751.908798408229 0 16855 > >> from 2513.176188033074 0 15488 > >> year 2323.752471229993 0 14391 > >> state 2295.767774611246 0 14228 > >> said 2201.3039346230216 0 13676 > >> it 2173.4997256915085 0 13513 > >> he 2054.0495802331716 0 12810 > >> new 1894.1558320098557 0 11862 > >> > >> Top labels for Cluster 129279 containing 14559 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 26080.197364640888 0 72060 > >> his 4430.338072712999 0 17203 > >> has 4333.689091425855 0 16855 > >> from 3957.116204748396 0 15488 > >> year 3658.40981121175 0 14391 > >> state 3614.286633652635 0 14228 > >> said 3465.358771919273 0 13676 > >> it 3421.527382406406 0 13513 > >> he 3233.2411222746596 0 12810 > >> new 2981.251407010015 0 11862 > >> > >> Top labels for Cluster 129290 containing 13592 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 24181.82589298836 0 72060 > >> his 4117.6785482652485 0 17203 > >> has 4027.8821644652635 0 16855 > >> from 3677.9947950267233 0 15488 > >> year 3400.440033295192 0 14391 > >> state 3359.4400672735646 0 14228 > >> said 3221.0516651300713 0 13676 > >> it 3180.321518546436 0 13513 > >> he 3005.353873868007 0 12810 > >> new 2771.180380204227 0 11862 > >> </snip> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
