OK, seems the problem is ClusterLabels was never updated when we switched
over to WeightedVectorWritable and it also seems like somewhere in the
equation of KMeans being run that we lost the NamedVector again, as the
clusteredPoints directory does not contain NamedVectors, even though that is
what I created the original points as when starting.

On Mon, Jul 5, 2010 at 1:55 PM, Grant Ingersoll <[email protected]> wrote:

> Hmmm, different field, more or less the same result, i.e. all labels are
> the same for each vector [1].  I also included the Cluster dump [2].  I'm
> suspecting a bug.
>
> [1]
> Top labels for Cluster 129022 containing 19186 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             5836.5347257247195              0               16867
> from            5328.54616727354                0               15499
> year            4925.276801970322               0               14400
> state           4866.91887763422                0               14240
> new             4011.6858639516868              0               11867
> after           3882.1740732807666              0               11503
> first           3002.5827110484242              0               8998
> two             2984.1892275922         0               8945
> unit            2930.794111499563               0               8791
> one             2686.95768492762                0               8085
>
> Top labels for Cluster 129119 containing 16043 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             4808.386086146813               0               16867
> from            4390.346637147013               0               15499
> year            4058.4180186586455              0               14400
> state           4010.379176544491               0               14240
> new             3306.234930681996               0               11867
> after           3199.5810555517673              0               11503
> first           2475.079962851014               0               8998
> two             2459.926843432244               0               8945
> unit            2415.9376569474116              0               8791
> one             2215.042654468678               0               8085
>
> Top labels for Cluster 129191 containing 7770 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2243.2657141932286              0               16867
> from            2048.755412856117               0               15499
> year            1894.2384706358425              0               14400
> state           1871.8704557279125              0               14240
> new             1543.8513879175298              0               11867
> after           1494.1429192917421              0               11503
> first           1156.303048826754               0               8998
> two             1149.2339147529565              0               8945
> unit            1128.711646862328               0               8791
> one             1034.9745452422649              0               8085
>
> Top labels for Cluster 129302 containing 9426 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2741.316972494591               0               16867
> from            2503.501101480797               0               15499
> year            2314.5996575923637              0               14400
> state           2287.255346294027               0               14240
> new             1886.2961270781234              0               11867
> after           1825.5399498036131              0               11503
> first           1412.654560342431               0               8998
> two             1404.0158626483753              0               8945
> unit            1378.9371921028942              0               8791
> one             1264.391515379306               0               8085
>
> Top labels for Cluster 129360 containing 13092 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3870.8181769265793              0               16867
> from            3534.623348234687               0               15499
> year            3267.633215776179               0               14400
> state           3228.989259615075               0               14240
> new             2662.4551618834957              0               11867
> after           2576.628638952039               0               11503
> first           1993.499155438505               0               8998
> two             1981.3008509986103              0               8945
> unit            1945.8889682726003              0               8791
> one             1784.1570986662991              0               8085
>
> Top labels for Cluster 129371 containing 23944 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             7455.31941217836                0               16867
> from            6805.274207816925               0               15499
> year            6289.398677708115               0               14400
> state           6214.757351316046               0               14240
> new             5121.23683049297                0               11867
> after           4955.695805796888               0               11503
> first           3831.788851835765               0               8998
> two             3808.2933898111805              0               8945
> unit            3740.0891623105854              0               8791
> one             3428.6551325367764              0               8085
>
> Top labels for Cluster 129373 containing 9885 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2880.6778563517146              0               16867
> from            2630.736483251676               0               15499
> year            2432.208566541318               0               14400
> state           2403.4711471684277              0               14240
> new             1982.0948037123308              0               11867
> after           1918.2465800205246              0               11503
> first           1484.359997350257               0               8998
> two             1475.282112147659               0               8945
> unit            1448.9285028181039              0               8791
> one             1328.560536378529               0               8085
>
> Top labels for Cluster 129377 containing 11303 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3314.8890487886965              0               16867
> from            3027.14497121796                0               15499
> year            2798.608615776524               0               14400
> state           2765.528720188886               0               14240
> new             2280.5166378575377              0               11867
> after           2207.0322705539875              0               11503
> first           1707.7044410486706              0               8998
> two             1697.2581536169164              0               8945
> unit            1666.932174641639               0               8791
> one             1528.4241032432765              0               8085
>
> Top labels for Cluster 129381 containing 11411 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             3348.190782570746               0               16867
> from            3057.545994592365               0               15499
> year            2826.7072093421593              0               14400
> state           2793.2941474220715              0               14240
> new             2303.4001871203072              0               11867
> after           2229.176642407663               0               11503
> first           1724.8293614634313              0               8998
> two             1714.2781240069307              0               8945
> unit            1683.6474849330261              0               8791
> one             1543.7481994605623              0               8085
>
> Top labels for Cluster 129391 containing 7334 vectors
> Term             LLR             In-ClusterDF            Out-ClusterDF
> has             2113.35227333894                0               16867
> from            1930.1305988361128              0               15499
> year            1784.577833758667               0               14400
> state           1763.5072347805835              0               14240
> new             1454.5072316131555              0               11867
> after           1407.6797917694785              0               11503
> first           1089.4127462548204              0               8998
> two             1082.7530186888762              0               8945
> unit            1063.4192575318739              0               8791
> one             975.1101242941804               0               8085
>
> [2]
> :C-129022: [0:0.001, 000:0.003, 004:0.000, 0040:0.000, 0060:0.000,
> 01:0.000, 0100:0.000, 0110:0.000,
>        Top Terms:
>                from
>  =>0.022236135215980328
>                u                                       =>
> 0.01589135359475966
>                busi
>  =>0.014789942880805335
>                bank
>  =>0.014395075820558541
>                us                                      =>
> 0.01402954110138604
>                presid                                  =>
> 0.01341952961319183
>                month
> =>0.012118726267037198
>                about
> =>0.011986047971260612
>                compani
> =>0.011201454374207618
>                obama                                   =>
> 0.01105482429336391
> :C-129119: [0:0.001, 00:0.000, 000:0.003, 03:0.000, 04:0.000, 05:0.000,
> 0656:0.000, 07:0.000, 09:0.00
>        Top Terms:
>                citi                                    =>
> 0.04119064757467011
>                former
>  =>0.030966538725529232
>                home
>  =>0.029642735534519644
>                player                                  =>
> 0.02879703136878369
>                soccer                                  =>
> 0.01847372541986708
>                has
> =>0.015236681440174855
>                mark
>  =>0.015185164518720528
>                new                                     =>
> 0.01266468154720074
>                polic                                   =>
> 0.01253454821409647
>                world
> =>0.011803315296178046
> :C-129191: [0:0.013, 00:0.002, 000:0.004, 000000:0.000, 001:0.000,
> 0011:0.000, 0022:0.000, 003:0.000,
>        Top Terms:
>                4
> =>0.027636996760550075
>                3
> =>0.026093296145846434
>                1                                       =>
> 0.02570191540464146
>                5
> =>0.024807189589701305
>                2
> =>0.023669513631826157
>                were
>  =>0.021134415210709086
>                sunday
>  =>0.017928504766147838
>                play
>  =>0.017243683740808733
>                through
> =>0.017133336974828554
>                game
>  =>0.017027790192043733
> :C-129302: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 002:0.000, 008:0.000,
> 01:0.000, 011:0.000, 0112:
>        Top Terms:
>                new
> =>0.039501149799390206
>                peopl                                   =>
> 0.01933397797740685
>                world
> =>0.017478792605253438
>                could
> =>0.013495142418778704
>                has
> =>0.012987326502897916
>                more
>  =>0.012585724039194569
>                from
>  =>0.012242682917236177
>                face                                    =>
>  0.0117046220661272
>                leader
>  =>0.011579584625370691
>                presid
>  =>0.011192085113854965
> :C-129360: [0:0.000, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 005:0.000, 007:0.000, 008:
>        Top Terms:
>                state
> =>0.044732720259456946
>                unit
>  =>0.032493582810588666
>                year
>  =>0.025651340609304542
>                san
> =>0.025617706557963606
>                after
> =>0.022019046306438913
>                francisco
> =>0.020771004252363168
>                california                              =>
> 0.01847124801606253
>                day
> =>0.015514125170527842
>                wednesday
> =>0.014587851421509652
>                citi
>  =>0.012973538756014369
> :C-129371: [0:0.002, 00:0.000, 000:0.001, 01:0.000, 010:0.000, 0134:0.000,
> 016:0.000, 02:0.000, 03:0.
>        Top Terms:
>                game                                    =>
> 0.04311022785679375
>                has                                     =>
> 0.03059922226267673
>                all
> =>0.027605073346921877
>                leagu                                   =>
>  0.0267627245855276
>                star                                    =>
> 0.02206632764439995
>                final
> =>0.020017765794918686
>                season                                  =>
> 0.01534931562714024
>                start                                   =>
> 0.01450896856938099
>                week
>  =>0.014407234069110549
>                nation                                  =>
> 0.01429746391305699
> :C-129373: [0:0.000, 00:0.000, 000:0.003, 001:0.000, 01:0.000, 012:0.000,
> 016:0.000, 03:0.000, 034:0.
>        Top Terms:
>                coach                                   =>
> 0.05209277512761816
>                team
>  =>0.031773971685165554
>                charg
> =>0.024246280249912454
>                from                                    =>
> 0.02093643936347752
>                has                                     =>
> 0.02057631329905952
>                week
>  =>0.016848920922797363
>                last                                    =>
> 0.01674320150844955
>                program
> =>0.016023081209070564
>                former
>  =>0.015872337289314063
>                after                                   =>
> 0.01341825692502786
> :C-129377: [0:0.002, 00:0.000, 000:0.004, 001:0.000, 002:0.000, 003:0.000,
> 006:0.000, 0065:0.000, 007
>        Top Terms:
>                been                                    =>
> 0.03757994091979662
>                time                                    =>
> 0.03591307497544333
>                first                                   =>
> 0.03422461795380875
>                has
> =>0.029800513863644906
>                feder
> =>0.027382680342986195
>                monday
>  =>0.022174840523045594
>                sinc                                    =>
> 0.02185219249613946
>                year                                    =>
> 0.01933420097135394
>                from                                    =>
> 0.01162537888358458
>                state
> =>0.009756869426688311
> :C-129381: [0:0.004, 00:0.000, 000:0.002, 00000000235:0.000, 001:0.000,
> 0011:0.000, 002:0.000, 0051:0
>        Top Terms:
>                win                                     =>
> 0.03267669747239372
>                one
> =>0.031009191445456212
>                second
>  =>0.028066582472705007
>                three
> =>0.026147346665631184
>                out                                     =>
>  0.0226123748207931
>                shot
>  =>0.020446190395276405
>                last
>  =>0.019624841184867056
>                night
> =>0.019103407305052604
>                over
>  =>0.017376642133669604
>                year
>  =>0.016475201865715022
> :C-129391: [0:0.003, 00:0.000, 000:0.001, 002:0.000, 01:0.000, 0112:0.000,
> 0123:0.000, 02:0.000, 0213
>        Top Terms:
>                championship
>  =>0.035449579372280104
>                run
> =>0.026446073370591447
>                art                                     =>
> 0.02489330236372834
>                open                                    =>
> 0.02282619503375418
>                place
> =>0.022410914360311056
>                grand                                   =>
>  0.0169734705340118
>                reuter
>  =>0.015895311339829302
>                6
> =>0.015700075983436933
>                continu
> =>0.015418929721703813
>                slam
>  =>0.012102435338420274
>
>
> -Grant
> On Jul 5, 2010, at 1:07 PM, Ted Dunning wrote:
>
> > Can't say just off-hand.
> >
> > What is the data?
> >
> > On Mon, Jul 5, 2010 at 8:20 AM, Grant Ingersoll <[email protected]>
> wrote:
> >
> >> I'm running ClusterLabels and it seems to be outputting the same values
> for
> >> every centroid [1].  When I run the cluster dumper, the top terms are
> fairly
> >> different for those same vectors.
> >>
> >> Have I hit a vagary of LLR or is this a bug?
> >>
> >>
> >> Thanks,
> >> Grant
> >>
> >>
> >> [1]
> >> <snip>
> >> Top labels for Cluster 129062 containing 22710 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               43269.00830466254               0               72060
> >> his             7185.503760070074               0               17203
> >> has             7028.243643655442               0               16855
> >> from            6415.739411605988               0               15488
> >> year            5930.141497239005               0               14391
> >> state           5858.43069797568                0               14228
> >> said            5616.422720833216               0               13676
> >> it              5545.207108973991               0               13513
> >> he              5239.340392438695               0               12810
> >> new             4830.124521905556               0               11862
> >>
> >> Top labels for Cluster 129145 containing 11188 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19576.26998734614               0               72060
> >> his             3352.5135342599824              0               17203
> >> has             3279.466228939127               0               16855
> >> from            2994.8128935270943              0               15488
> >> year            2768.974903047085               0               14391
> >> state           2735.612128134351               0               14228
> >> said            2622.997358441353               0               13676
> >> it              2589.8515553446487              0               13513
> >> he              2447.4579147226177              0               12810
> >> new             2256.8640938592143              0               11862
> >>
> >> Top labels for Cluster 129201 containing 13040 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               23110.173012922285              0               72060
> >> his             3940.4691014224663              0               17203
> >> has             3854.554399965331               0               16855
> >> from            3519.784154796507               0               15488
> >> year            3254.2127395244315              0               14391
> >> state           3214.9822960514575              0               14228
> >> said            3082.565408431459               0               13676
> >> it              3043.5924300444312              0               13513
> >> he              2876.171367166564               0               12810
> >> new             2652.0934832417406              0               11862
> >>
> >> Top labels for Cluster 129211 containing 14053 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               25083.46391701023               0               72060
> >> his             4266.378291217145               0               17203
> >> has             4173.323467798065               0               16855
> >> from            3810.7467373879626              0               15488
> >> year            3523.1337431534193              0               14391
> >> state           3480.648573280778               0               14228
> >> said            3337.2482196930796              0               13676
> >> it              3295.0432900944725              0               13513
> >> he              3113.741967030335               0               12810
> >> new             2871.0957860480994              0               11862
> >>
> >> Top labels for Cluster 129242 containing 12861 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               22764.503256496973              0               72060
> >> his             3883.2002838114277              0               17203
> >> has             3798.5396822127514              0               16855
> >> from            3468.6536546614952              0               15488
> >> year            3206.954131908249               0               14391
> >> state           3168.2954448102973              0               14228
> >> said            3037.808057511691               0               13676
> >> it              2999.402857856825               0               13513
> >> he              2834.4202939094976              0               12810
> >> new             2613.604658874683               0               11862
> >>
> >> Top labels for Cluster 129245 containing 6443 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               10925.268199045677              0               72060
> >> his             1890.511348863598               0               17203
> >> has             1849.385320336558               0               16855
> >> from            1689.0946326381527              0               15488
> >> year            1561.8904545903206              0               14391
> >> state           1543.096286157146               0               14228
> >> said            1479.652662154287               0               13676
> >> it              1460.9780013803393              0               13513
> >> he              1380.745082413312               0               12810
> >> new             1273.3357145632617              0               11862
> >>
> >> Top labels for Cluster 129255 containing 11390 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               19957.211259535048              0               72060
> >> his             3416.1555761522613              0               17203
> >> has             3341.7163103362545              0               16855
> >> from            3051.6410844950005              0               15488
> >> year            2821.504116652999               0               14391
> >> state           2787.5064550531097              0               14228
> >> said            2672.7490201727487              0               13676
> >> it              2638.972676954698               0               13513
> >> he              2493.870809029322               0               12810
> >> new             2299.653438703157               0               11862
> >>
> >> Top labels for Cluster 129265 containing 9461 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               16362.85457371641               0               72060
> >> his             2813.167819214519               0               17203
> >> has             2751.908798408229               0               16855
> >> from            2513.176188033074               0               15488
> >> year            2323.752471229993               0               14391
> >> state           2295.767774611246               0               14228
> >> said            2201.3039346230216              0               13676
> >> it              2173.4997256915085              0               13513
> >> he              2054.0495802331716              0               12810
> >> new             1894.1558320098557              0               11862
> >>
> >> Top labels for Cluster 129279 containing 14559 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               26080.197364640888              0               72060
> >> his             4430.338072712999               0               17203
> >> has             4333.689091425855               0               16855
> >> from            3957.116204748396               0               15488
> >> year            3658.40981121175                0               14391
> >> state           3614.286633652635               0               14228
> >> said            3465.358771919273               0               13676
> >> it              3421.527382406406               0               13513
> >> he              3233.2411222746596              0               12810
> >> new             2981.251407010015               0               11862
> >>
> >> Top labels for Cluster 129290 containing 13592 vectors
> >> Term             LLR             In-ClusterDF            Out-ClusterDF
> >> a               24181.82589298836               0               72060
> >> his             4117.6785482652485              0               17203
> >> has             4027.8821644652635              0               16855
> >> from            3677.9947950267233              0               15488
> >> year            3400.440033295192               0               14391
> >> state           3359.4400672735646              0               14228
> >> said            3221.0516651300713              0               13676
> >> it              3180.321518546436               0               13513
> >> he              3005.353873868007               0               12810
> >> new             2771.180380204227               0               11862
> >> </snip>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to