https://issues.apache.org/jira/browse/MAHOUT-433
On Mon, Jul 5, 2010 at 2:28 PM, Grant Ingersoll <[email protected]> wrote: > OK, seems the problem is ClusterLabels was never updated when we switched > over to WeightedVectorWritable and it also seems like somewhere in the > equation of KMeans being run that we lost the NamedVector again, as the > clusteredPoints directory does not contain NamedVectors, even though that is > what I created the original points as when starting. > > > On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <[email protected]>wrote: > >> Hmmm, different field, more or less the same result, i.e. all labels are >> the same for each vector [1]. I also included the Cluster dump [2]. I'm >> suspecting a bug. >> >> [1] >> Top labels for Cluster 129022 containing 19186 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 5836.5347257247195 0 16867 >> from 5328.54616727354 0 15499 >> year 4925.276801970322 0 14400 >> state 4866.91887763422 0 14240 >> new 4011.6858639516868 0 11867 >> after 3882.1740732807666 0 11503 >> first 3002.5827110484242 0 8998 >> two 2984.1892275922 0 8945 >> unit 2930.794111499563 0 8791 >> one 2686.95768492762 0 8085 >> >> Top labels for Cluster 129119 containing 16043 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 4808.386086146813 0 16867 >> from 4390.346637147013 0 15499 >> year 4058.4180186586455 0 14400 >> state 4010.379176544491 0 14240 >> new 3306.234930681996 0 11867 >> after 3199.5810555517673 0 11503 >> first 2475.079962851014 0 8998 >> two 2459.926843432244 0 8945 >> unit 2415.9376569474116 0 8791 >> one 2215.042654468678 0 8085 >> >> Top labels for Cluster 129191 containing 7770 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 2243.2657141932286 0 16867 >> from 2048.755412856117 0 15499 >> year 1894.2384706358425 0 14400 >> state 1871.8704557279125 0 14240 >> new 1543.8513879175298 0 11867 >> after 1494.1429192917421 0 11503 >> first 1156.303048826754 0 8998 >> two 1149.2339147529565 0 8945 >> unit 1128.711646862328 0 8791 >> one 1034.9745452422649 0 8085 >> >> Top labels for Cluster 129302 containing 9426 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 2741.316972494591 0 16867 >> from 2503.501101480797 0 15499 >> year 2314.5996575923637 0 14400 >> state 2287.255346294027 0 14240 >> new 1886.2961270781234 0 11867 >> after 1825.5399498036131 0 11503 >> first 1412.654560342431 0 8998 >> two 1404.0158626483753 0 8945 >> unit 1378.9371921028942 0 8791 >> one 1264.391515379306 0 8085 >> >> Top labels for Cluster 129360 containing 13092 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 3870.8181769265793 0 16867 >> from 3534.623348234687 0 15499 >> year 3267.633215776179 0 14400 >> state 3228.989259615075 0 14240 >> new 2662.4551618834957 0 11867 >> after 2576.628638952039 0 11503 >> first 1993.499155438505 0 8998 >> two 1981.3008509986103 0 8945 >> unit 1945.8889682726003 0 8791 >> one 1784.1570986662991 0 8085 >> >> Top labels for Cluster 129371 containing 23944 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 7455.31941217836 0 16867 >> from 6805.274207816925 0 15499 >> year 6289.398677708115 0 14400 >> state 6214.757351316046 0 14240 >> new 5121.23683049297 0 11867 >> after 4955.695805796888 0 11503 >> first 3831.788851835765 0 8998 >> two 3808.2933898111805 0 8945 >> unit 3740.0891623105854 0 8791 >> one 3428.6551325367764 0 8085 >> >> Top labels for Cluster 129373 containing 9885 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 2880.6778563517146 0 16867 >> from 2630.736483251676 0 15499 >> year 2432.208566541318 0 14400 >> state 2403.4711471684277 0 14240 >> new 1982.0948037123308 0 11867 >> after 1918.2465800205246 0 11503 >> first 1484.359997350257 0 8998 >> two 1475.282112147659 0 8945 >> unit 1448.9285028181039 0 8791 >> one 1328.560536378529 0 8085 >> >> Top labels for Cluster 129377 containing 11303 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 3314.8890487886965 0 16867 >> from 3027.14497121796 0 15499 >> year 2798.608615776524 0 14400 >> state 2765.528720188886 0 14240 >> new 2280.5166378575377 0 11867 >> after 2207.0322705539875 0 11503 >> first 1707.7044410486706 0 8998 >> two 1697.2581536169164 0 8945 >> unit 1666.932174641639 0 8791 >> one 1528.4241032432765 0 8085 >> >> Top labels for Cluster 129381 containing 11411 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 3348.190782570746 0 16867 >> from 3057.545994592365 0 15499 >> year 2826.7072093421593 0 14400 >> state 2793.2941474220715 0 14240 >> new 2303.4001871203072 0 11867 >> after 2229.176642407663 0 11503 >> first 1724.8293614634313 0 8998 >> two 1714.2781240069307 0 8945 >> unit 1683.6474849330261 0 8791 >> one 1543.7481994605623 0 8085 >> >> Top labels for Cluster 129391 containing 7334 vectors >> Term LLR In-ClusterDF Out-ClusterDF >> has 2113.35227333894 0 16867 >> from 1930.1305988361128 0 15499 >> year 1784.577833758667 0 14400 >> state 1763.5072347805835 0 14240 >> new 1454.5072316131555 0 11867 >> after 1407.6797917694785 0 11503 >> first 1089.4127462548204 0 8998 >> two 1082.7530186888762 0 8945 >> unit 1063.4192575318739 0 8791 >> one 975.1101242941804 0 8085 >> >> [2] >> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000, >> 01:0.000, 0100:0.000, 0110:0.000, >> Top Terms: >> from >> =>0.022236135215980328 >> u => >> 0.01589135359475966 >> busi >> =>0.014789942880805335 >> bank >> =>0.014395075820558541 >> us => >> 0.01402954110138604 >> presid => >> 0.01341952961319183 >> month >> =>0.012118726267037198 >> about >> =>0.011986047971260612 >> compani >> =>0.011201454374207618 >> obama => >> 0.01105482429336391 >> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000, >> 0656:0.000, 07:0.000, 09:0.00 >> Top Terms: >> citi => >> 0.04119064757467011 >> former >> =>0.030966538725529232 >> home >> =>0.029642735534519644 >> player => >> 0.02879703136878369 >> soccer => >> 0.01847372541986708 >> has >> =>0.015236681440174855 >> mark >> =>0.015185164518720528 >> new => >> 0.01266468154720074 >> polic => >> 0.01253454821409647 >> world >> =>0.011803315296178046 >> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000, >> 0011:0.000, 0022:0.000, 003:0.000, >> Top Terms: >> 4 >> =>0.027636996760550075 >> 3 >> =>0.026093296145846434 >> 1 => >> 0.02570191540464146 >> 5 >> =>0.024807189589701305 >> 2 >> =>0.023669513631826157 >> were >> =>0.021134415210709086 >> sunday >> =>0.017928504766147838 >> play >> =>0.017243683740808733 >> through >> =>0.017133336974828554 >> game >> =>0.017027790192043733 >> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000, >> 01:0.000, 011:0.000, 0112: >> Top Terms: >> new >> =>0.039501149799390206 >> peopl => >> 0.01933397797740685 >> world >> =>0.017478792605253438 >> could >> =>0.013495142418778704 >> has >> =>0.012987326502897916 >> more >> =>0.012585724039194569 >> from >> =>0.012242682917236177 >> face => >> 0.0117046220661272 >> leader >> =>0.011579584625370691 >> presid >> =>0.011192085113854965 >> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, >> 005:0.000, 007:0.000, 008: >> Top Terms: >> state >> =>0.044732720259456946 >> unit >> =>0.032493582810588666 >> year >> =>0.025651340609304542 >> san >> =>0.025617706557963606 >> after >> =>0.022019046306438913 >> francisco >> =>0.020771004252363168 >> california => >> 0.01847124801606253 >> day >> =>0.015514125170527842 >> wednesday >> =>0.014587851421509652 >> citi >> =>0.012973538756014369 >> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000, >> 016:0.000, 02:0.000, 03:0. >> Top Terms: >> game => >> 0.04311022785679375 >> has => >> 0.03059922226267673 >> all >> =>0.027605073346921877 >> leagu => >> 0.0267627245855276 >> star => >> 0.02206632764439995 >> final >> =>0.020017765794918686 >> season => >> 0.01534931562714024 >> start => >> 0.01450896856938099 >> week >> =>0.014407234069110549 >> nation => >> 0.01429746391305699 >> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000, >> 016:0.000, 03:0.000, 034:0. >> Top Terms: >> coach => >> 0.05209277512761816 >> team >> =>0.031773971685165554 >> charg >> =>0.024246280249912454 >> from => >> 0.02093643936347752 >> has => >> 0.02057631329905952 >> week >> =>0.016848920922797363 >> last => >> 0.01674320150844955 >> program >> =>0.016023081209070564 >> former >> =>0.015872337289314063 >> after => >> 0.01341825692502786 >> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000, >> 006:0.000, 0065:0.000, 007 >> Top Terms: >> been => >> 0.03757994091979662 >> time => >> 0.03591307497544333 >> first => >> 0.03422461795380875 >> has >> =>0.029800513863644906 >> feder >> =>0.027382680342986195 >> monday >> =>0.022174840523045594 >> sinc => >> 0.02185219249613946 >> year => >> 0.01933420097135394 >> from => >> 0.01162537888358458 >> state >> =>0.009756869426688311 >> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000, >> 0011:0.000, 002:0.000, 0051:0 >> Top Terms: >> win => >> 0.03267669747239372 >> one >> =>0.031009191445456212 >> second >> =>0.028066582472705007 >> three >> =>0.026147346665631184 >> out => >> 0.0226123748207931 >> shot >> =>0.020446190395276405 >> last >> =>0.019624841184867056 >> night >> =>0.019103407305052604 >> over >> =>0.017376642133669604 >> year >> =>0.016475201865715022 >> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000, >> 0123:0.000, 02:0.000, 0213 >> Top Terms: >> championship >> =>0.035449579372280104 >> run >> =>0.026446073370591447 >> art => >> 0.02489330236372834 >> open => >> 0.02282619503375418 >> place >> =>0.022410914360311056 >> grand => >> 0.0169734705340118 >> reuter >> =>0.015895311339829302 >> 6 >> =>0.015700075983436933 >> continu >> =>0.015418929721703813 >> slam >> =>0.012102435338420274 >> >> >> -Grant >> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote: >> >> > Can't say just off-hand. >> > >> > What is the data? >> > >> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]> >> wrote: >> > >> >> I'm running ClusterLabels and it seems to be outputting the same values >> for >> >> every centroid [1]. When I run the cluster dumper, the top terms are >> fairly >> >> different for those same vectors. >> >> >> >> Have I hit a vagary of LLR or is this a bug? >> >> >> >> >> >> Thanks, >> >> Grant >> >> >> >> >> >> [1] >> >> <snip> >> >> Top labels for Cluster 129062 containing 22710 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 43269.00830466254 0 72060 >> >> his 7185.503760070074 0 17203 >> >> has 7028.243643655442 0 16855 >> >> from 6415.739411605988 0 15488 >> >> year 5930.141497239005 0 14391 >> >> state 5858.43069797568 0 14228 >> >> said 5616.422720833216 0 13676 >> >> it 5545.207108973991 0 13513 >> >> he 5239.340392438695 0 12810 >> >> new 4830.124521905556 0 11862 >> >> >> >> Top labels for Cluster 129145 containing 11188 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 19576.26998734614 0 72060 >> >> his 3352.5135342599824 0 17203 >> >> has 3279.466228939127 0 16855 >> >> from 2994.8128935270943 0 15488 >> >> year 2768.974903047085 0 14391 >> >> state 2735.612128134351 0 14228 >> >> said 2622.997358441353 0 13676 >> >> it 2589.8515553446487 0 13513 >> >> he 2447.4579147226177 0 12810 >> >> new 2256.8640938592143 0 11862 >> >> >> >> Top labels for Cluster 129201 containing 13040 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 23110.173012922285 0 72060 >> >> his 3940.4691014224663 0 17203 >> >> has 3854.554399965331 0 16855 >> >> from 3519.784154796507 0 15488 >> >> year 3254.2127395244315 0 14391 >> >> state 3214.9822960514575 0 14228 >> >> said 3082.565408431459 0 13676 >> >> it 3043.5924300444312 0 13513 >> >> he 2876.171367166564 0 12810 >> >> new 2652.0934832417406 0 11862 >> >> >> >> Top labels for Cluster 129211 containing 14053 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 25083.46391701023 0 72060 >> >> his 4266.378291217145 0 17203 >> >> has 4173.323467798065 0 16855 >> >> from 3810.7467373879626 0 15488 >> >> year 3523.1337431534193 0 14391 >> >> state 3480.648573280778 0 14228 >> >> said 3337.2482196930796 0 13676 >> >> it 3295.0432900944725 0 13513 >> >> he 3113.741967030335 0 12810 >> >> new 2871.0957860480994 0 11862 >> >> >> >> Top labels for Cluster 129242 containing 12861 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 22764.503256496973 0 72060 >> >> his 3883.2002838114277 0 17203 >> >> has 3798.5396822127514 0 16855 >> >> from 3468.6536546614952 0 15488 >> >> year 3206.954131908249 0 14391 >> >> state 3168.2954448102973 0 14228 >> >> said 3037.808057511691 0 13676 >> >> it 2999.402857856825 0 13513 >> >> he 2834.4202939094976 0 12810 >> >> new 2613.604658874683 0 11862 >> >> >> >> Top labels for Cluster 129245 containing 6443 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 10925.268199045677 0 72060 >> >> his 1890.511348863598 0 17203 >> >> has 1849.385320336558 0 16855 >> >> from 1689.0946326381527 0 15488 >> >> year 1561.8904545903206 0 14391 >> >> state 1543.096286157146 0 14228 >> >> said 1479.652662154287 0 13676 >> >> it 1460.9780013803393 0 13513 >> >> he 1380.745082413312 0 12810 >> >> new 1273.3357145632617 0 11862 >> >> >> >> Top labels for Cluster 129255 containing 11390 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 19957.211259535048 0 72060 >> >> his 3416.1555761522613 0 17203 >> >> has 3341.7163103362545 0 16855 >> >> from 3051.6410844950005 0 15488 >> >> year 2821.504116652999 0 14391 >> >> state 2787.5064550531097 0 14228 >> >> said 2672.7490201727487 0 13676 >> >> it 2638.972676954698 0 13513 >> >> he 2493.870809029322 0 12810 >> >> new 2299.653438703157 0 11862 >> >> >> >> Top labels for Cluster 129265 containing 9461 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 16362.85457371641 0 72060 >> >> his 2813.167819214519 0 17203 >> >> has 2751.908798408229 0 16855 >> >> from 2513.176188033074 0 15488 >> >> year 2323.752471229993 0 14391 >> >> state 2295.767774611246 0 14228 >> >> said 2201.3039346230216 0 13676 >> >> it 2173.4997256915085 0 13513 >> >> he 2054.0495802331716 0 12810 >> >> new 1894.1558320098557 0 11862 >> >> >> >> Top labels for Cluster 129279 containing 14559 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 26080.197364640888 0 72060 >> >> his 4430.338072712999 0 17203 >> >> has 4333.689091425855 0 16855 >> >> from 3957.116204748396 0 15488 >> >> year 3658.40981121175 0 14391 >> >> state 3614.286633652635 0 14228 >> >> said 3465.358771919273 0 13676 >> >> it 3421.527382406406 0 13513 >> >> he 3233.2411222746596 0 12810 >> >> new 2981.251407010015 0 11862 >> >> >> >> Top labels for Cluster 129290 containing 13592 vectors >> >> Term LLR In-ClusterDF Out-ClusterDF >> >> a 24181.82589298836 0 72060 >> >> his 4117.6785482652485 0 17203 >> >> has 4027.8821644652635 0 16855 >> >> from 3677.9947950267233 0 15488 >> >> year 3400.440033295192 0 14391 >> >> state 3359.4400672735646 0 14228 >> >> said 3221.0516651300713 0 13676 >> >> it 3180.321518546436 0 13513 >> >> he 3005.353873868007 0 12810 >> >> new 2771.180380204227 0 11862 >> >> </snip> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >
