MAHOUT-434 solves the problem.
On Jul 5, 2010, at 2:34 PM, Grant Ingersoll wrote: > https://issues.apache.org/jira/browse/MAHOUT-433 > > On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <[email protected]> wrote: > OK, seems the problem is ClusterLabels was never updated when we switched > over to WeightedVectorWritable and it also seems like somewhere in the > equation of KMeans being run that we lost the NamedVector again, as the > clusteredPoints directory does not contain NamedVectors, even though that is > what I created the original points as when starting. > > > On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <[email protected]> wrote: > Hmmm, different field, more or less the same result, i.e. all labels are the > same for each vector [1]. I also included the Cluster dump [2]. I'm > suspecting a bug. > > [1] > Top labels for Cluster 129022 containing 19186 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 5836.5347257247195 0 16867 > from 5328.54616727354 0 15499 > year 4925.276801970322 0 14400 > state 4866.91887763422 0 14240 > new 4011.6858639516868 0 11867 > after 3882.1740732807666 0 11503 > first 3002.5827110484242 0 8998 > two 2984.1892275922 0 8945 > unit 2930.794111499563 0 8791 > one 2686.95768492762 0 8085 > > Top labels for Cluster 129119 containing 16043 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 4808.386086146813 0 16867 > from 4390.346637147013 0 15499 > year 4058.4180186586455 0 14400 > state 4010.379176544491 0 14240 > new 3306.234930681996 0 11867 > after 3199.5810555517673 0 11503 > first 2475.079962851014 0 8998 > two 2459.926843432244 0 8945 > unit 2415.9376569474116 0 8791 > one 2215.042654468678 0 8085 > > Top labels for Cluster 129191 containing 7770 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2243.2657141932286 0 16867 > from 2048.755412856117 0 15499 > year 1894.2384706358425 0 14400 > state 1871.8704557279125 0 14240 > new 1543.8513879175298 0 11867 > after 1494.1429192917421 0 11503 > first 1156.303048826754 0 8998 > two 1149.2339147529565 0 8945 > unit 1128.711646862328 0 8791 > one 1034.9745452422649 0 8085 > > Top labels for Cluster 129302 containing 9426 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2741.316972494591 0 16867 > from 2503.501101480797 0 15499 > year 2314.5996575923637 0 14400 > state 2287.255346294027 0 14240 > new 1886.2961270781234 0 11867 > after 1825.5399498036131 0 11503 > first 1412.654560342431 0 8998 > two 1404.0158626483753 0 8945 > unit 1378.9371921028942 0 8791 > one 1264.391515379306 0 8085 > > Top labels for Cluster 129360 containing 13092 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3870.8181769265793 0 16867 > from 3534.623348234687 0 15499 > year 3267.633215776179 0 14400 > state 3228.989259615075 0 14240 > new 2662.4551618834957 0 11867 > after 2576.628638952039 0 11503 > first 1993.499155438505 0 8998 > two 1981.3008509986103 0 8945 > unit 1945.8889682726003 0 8791 > one 1784.1570986662991 0 8085 > > Top labels for Cluster 129371 containing 23944 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 7455.31941217836 0 16867 > from 6805.274207816925 0 15499 > year 6289.398677708115 0 14400 > state 6214.757351316046 0 14240 > new 5121.23683049297 0 11867 > after 4955.695805796888 0 11503 > first 3831.788851835765 0 8998 > two 3808.2933898111805 0 8945 > unit 3740.0891623105854 0 8791 > one 3428.6551325367764 0 8085 > > Top labels for Cluster 129373 containing 9885 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2880.6778563517146 0 16867 > from 2630.736483251676 0 15499 > year 2432.208566541318 0 14400 > state 2403.4711471684277 0 14240 > new 1982.0948037123308 0 11867 > after 1918.2465800205246 0 11503 > first 1484.359997350257 0 8998 > two 1475.282112147659 0 8945 > unit 1448.9285028181039 0 8791 > one 1328.560536378529 0 8085 > > Top labels for Cluster 129377 containing 11303 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3314.8890487886965 0 16867 > from 3027.14497121796 0 15499 > year 2798.608615776524 0 14400 > state 2765.528720188886 0 14240 > new 2280.5166378575377 0 11867 > after 2207.0322705539875 0 11503 > first 1707.7044410486706 0 8998 > two 1697.2581536169164 0 8945 > unit 1666.932174641639 0 8791 > one 1528.4241032432765 0 8085 > > Top labels for Cluster 129381 containing 11411 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 3348.190782570746 0 16867 > from 3057.545994592365 0 15499 > year 2826.7072093421593 0 14400 > state 2793.2941474220715 0 14240 > new 2303.4001871203072 0 11867 > after 2229.176642407663 0 11503 > first 1724.8293614634313 0 8998 > two 1714.2781240069307 0 8945 > unit 1683.6474849330261 0 8791 > one 1543.7481994605623 0 8085 > > Top labels for Cluster 129391 containing 7334 vectors > Term LLR In-ClusterDF Out-ClusterDF > has 2113.35227333894 0 16867 > from 1930.1305988361128 0 15499 > year 1784.577833758667 0 14400 > state 1763.5072347805835 0 14240 > new 1454.5072316131555 0 11867 > after 1407.6797917694785 0 11503 > first 1089.4127462548204 0 8998 > two 1082.7530186888762 0 8945 > unit 1063.4192575318739 0 8791 > one 975.1101242941804 0 8085 > > [2] > :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, 01:0.000, > 0100:0.000, 0110:0.000, > Top Terms: > from =>0.022236135215980328 > u => 0.01589135359475966 > busi =>0.014789942880805335 > bank =>0.014395075820558541 > us => 0.01402954110138604 > presid => 0.01341952961319183 > month =>0.012118726267037198 > about =>0.011986047971260612 > compani =>0.011201454374207618 > obama => 0.01105482429336391 > :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, > 0656:0.000, 07:0.000, 09:0.00 > Top Terms: > citi => 0.04119064757467011 > former =>0.030966538725529232 > home =>0.029642735534519644 > player => 0.02879703136878369 > soccer => 0.01847372541986708 > has =>0.015236681440174855 > mark =>0.015185164518720528 > new => 0.01266468154720074 > polic => 0.01253454821409647 > world =>0.011803315296178046 > :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, > 0011:0.000, 0022:0.000, 003:0.000, > Top Terms: > 4 =>0.027636996760550075 > 3 =>0.026093296145846434 > 1 => 0.02570191540464146 > 5 =>0.024807189589701305 > 2 =>0.023669513631826157 > were =>0.021134415210709086 > sunday =>0.017928504766147838 > play =>0.017243683740808733 > through =>0.017133336974828554 > game =>0.017027790192043733 > :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, > 01:0.000, 011:0.000, 0112: > Top Terms: > new =>0.039501149799390206 > peopl => 0.01933397797740685 > world =>0.017478792605253438 > could =>0.013495142418778704 > has =>0.012987326502897916 > more =>0.012585724039194569 > from =>0.012242682917236177 > face => 0.0117046220661272 > leader =>0.011579584625370691 > presid =>0.011192085113854965 > :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, > 005:0.000, 007:0.000, 008: > Top Terms: > state =>0.044732720259456946 > unit =>0.032493582810588666 > year =>0.025651340609304542 > san =>0.025617706557963606 > after =>0.022019046306438913 > francisco =>0.020771004252363168 > california => 0.01847124801606253 > day =>0.015514125170527842 > wednesday =>0.014587851421509652 > citi =>0.012973538756014369 > :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, > 016:0.000, 02:0.000, 03:0. > Top Terms: > game => 0.04311022785679375 > has => 0.03059922226267673 > all =>0.027605073346921877 > leagu => 0.0267627245855276 > star => 0.02206632764439995 > final =>0.020017765794918686 > season => 0.01534931562714024 > start => 0.01450896856938099 > week =>0.014407234069110549 > nation => 0.01429746391305699 > :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, > 016:0.000, 03:0.000, 034:0. > Top Terms: > coach => 0.05209277512761816 > team =>0.031773971685165554 > charg =>0.024246280249912454 > from => 0.02093643936347752 > has => 0.02057631329905952 > week =>0.016848920922797363 > last => 0.01674320150844955 > program =>0.016023081209070564 > former =>0.015872337289314063 > after => 0.01341825692502786 > :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, > 006:0.000, 0065:0.000, 007 > Top Terms: > been => 0.03757994091979662 > time => 0.03591307497544333 > first => 0.03422461795380875 > has =>0.029800513863644906 > feder =>0.027382680342986195 > monday =>0.022174840523045594 > sinc => 0.02185219249613946 > year => 0.01933420097135394 > from => 0.01162537888358458 > state =>0.009756869426688311 > :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, > 0011:0.000, 002:0.000, 0051:0 > Top Terms: > win => 0.03267669747239372 > one =>0.031009191445456212 > second =>0.028066582472705007 > three =>0.026147346665631184 > out => 0.0226123748207931 > shot =>0.020446190395276405 > last =>0.019624841184867056 > night =>0.019103407305052604 > over =>0.017376642133669604 > year =>0.016475201865715022 > :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, > 0123:0.000, 02:0.000, 0213 > Top Terms: > championship =>0.035449579372280104 > run =>0.026446073370591447 > art => 0.02489330236372834 > open => 0.02282619503375418 > place =>0.022410914360311056 > grand => 0.0169734705340118 > reuter =>0.015895311339829302 > 6 =>0.015700075983436933 > continu =>0.015418929721703813 > slam =>0.012102435338420274 > > > -Grant > On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote: > > > Can't say just off-hand. > > > > What is the data? > > > > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]> wrote: > > > >> I'm running ClusterLabels and it seems to be outputting the same values for > >> every centroid [1]. When I run the cluster dumper, the top terms are > >> fairly > >> different for those same vectors. > >> > >> Have I hit a vagary of LLR or is this a bug? > >> > >> > >> Thanks, > >> Grant > >> > >> > >> [1] > >> <snip> > >> Top labels for Cluster 129062 containing 22710 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 43269.00830466254 0 72060 > >> his 7185.503760070074 0 17203 > >> has 7028.243643655442 0 16855 > >> from 6415.739411605988 0 15488 > >> year 5930.141497239005 0 14391 > >> state 5858.43069797568 0 14228 > >> said 5616.422720833216 0 13676 > >> it 5545.207108973991 0 13513 > >> he 5239.340392438695 0 12810 > >> new 4830.124521905556 0 11862 > >> > >> Top labels for Cluster 129145 containing 11188 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 19576.26998734614 0 72060 > >> his 3352.5135342599824 0 17203 > >> has 3279.466228939127 0 16855 > >> from 2994.8128935270943 0 15488 > >> year 2768.974903047085 0 14391 > >> state 2735.612128134351 0 14228 > >> said 2622.997358441353 0 13676 > >> it 2589.8515553446487 0 13513 > >> he 2447.4579147226177 0 12810 > >> new 2256.8640938592143 0 11862 > >> > >> Top labels for Cluster 129201 containing 13040 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 23110.173012922285 0 72060 > >> his 3940.4691014224663 0 17203 > >> has 3854.554399965331 0 16855 > >> from 3519.784154796507 0 15488 > >> year 3254.2127395244315 0 14391 > >> state 3214.9822960514575 0 14228 > >> said 3082.565408431459 0 13676 > >> it 3043.5924300444312 0 13513 > >> he 2876.171367166564 0 12810 > >> new 2652.0934832417406 0 11862 > >> > >> Top labels for Cluster 129211 containing 14053 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 25083.46391701023 0 72060 > >> his 4266.378291217145 0 17203 > >> has 4173.323467798065 0 16855 > >> from 3810.7467373879626 0 15488 > >> year 3523.1337431534193 0 14391 > >> state 3480.648573280778 0 14228 > >> said 3337.2482196930796 0 13676 > >> it 3295.0432900944725 0 13513 > >> he 3113.741967030335 0 12810 > >> new 2871.0957860480994 0 11862 > >> > >> Top labels for Cluster 129242 containing 12861 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 22764.503256496973 0 72060 > >> his 3883.2002838114277 0 17203 > >> has 3798.5396822127514 0 16855 > >> from 3468.6536546614952 0 15488 > >> year 3206.954131908249 0 14391 > >> state 3168.2954448102973 0 14228 > >> said 3037.808057511691 0 13676 > >> it 2999.402857856825 0 13513 > >> he 2834.4202939094976 0 12810 > >> new 2613.604658874683 0 11862 > >> > >> Top labels for Cluster 129245 containing 6443 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 10925.268199045677 0 72060 > >> his 1890.511348863598 0 17203 > >> has 1849.385320336558 0 16855 > >> from 1689.0946326381527 0 15488 > >> year 1561.8904545903206 0 14391 > >> state 1543.096286157146 0 14228 > >> said 1479.652662154287 0 13676 > >> it 1460.9780013803393 0 13513 > >> he 1380.745082413312 0 12810 > >> new 1273.3357145632617 0 11862 > >> > >> Top labels for Cluster 129255 containing 11390 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 19957.211259535048 0 72060 > >> his 3416.1555761522613 0 17203 > >> has 3341.7163103362545 0 16855 > >> from 3051.6410844950005 0 15488 > >> year 2821.504116652999 0 14391 > >> state 2787.5064550531097 0 14228 > >> said 2672.7490201727487 0 13676 > >> it 2638.972676954698 0 13513 > >> he 2493.870809029322 0 12810 > >> new 2299.653438703157 0 11862 > >> > >> Top labels for Cluster 129265 containing 9461 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 16362.85457371641 0 72060 > >> his 2813.167819214519 0 17203 > >> has 2751.908798408229 0 16855 > >> from 2513.176188033074 0 15488 > >> year 2323.752471229993 0 14391 > >> state 2295.767774611246 0 14228 > >> said 2201.3039346230216 0 13676 > >> it 2173.4997256915085 0 13513 > >> he 2054.0495802331716 0 12810 > >> new 1894.1558320098557 0 11862 > >> > >> Top labels for Cluster 129279 containing 14559 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 26080.197364640888 0 72060 > >> his 4430.338072712999 0 17203 > >> has 4333.689091425855 0 16855 > >> from 3957.116204748396 0 15488 > >> year 3658.40981121175 0 14391 > >> state 3614.286633652635 0 14228 > >> said 3465.358771919273 0 13676 > >> it 3421.527382406406 0 13513 > >> he 3233.2411222746596 0 12810 > >> new 2981.251407010015 0 11862 > >> > >> Top labels for Cluster 129290 containing 13592 vectors > >> Term LLR In-ClusterDF Out-ClusterDF > >> a 24181.82589298836 0 72060 > >> his 4117.6785482652485 0 17203 > >> has 4027.8821644652635 0 16855 > >> from 3677.9947950267233 0 15488 > >> year 3400.440033295192 0 14391 > >> state 3359.4400672735646 0 14228 > >> said 3221.0516651300713 0 13676 > >> it 3180.321518546436 0 13513 > >> he 3005.353873868007 0 12810 > >> new 2771.180380204227 0 11862 > >> </snip> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
