On Fri, Dec 28, 2012 at 12:35 AM, Dan Filimon
<[email protected]>wrote:
> I have a couple of questions:
> - how did you pick 1000 as the dimension of the vectors?
>
Out of nowhere. Partly motivated by a desire to be able to pull the data
into R.
> - what is spoking behavior? is it that there seem to be some lines
> going through the origin that points tend to be on?
>
Yes.
> - when you say you built a multinomial model, how did you see strong
> signals? I'm not sure how you used it actually. :)
>
> library(nnet)
> m=multinom(group ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15, x0)
> m
Call:
multinom(formula = group ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 +
x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15, data = x0)
Coefficients:
(Intercept) x1 x2 x3
comp.graphics 0.44074296 1.01907543 -1.8568888 -9.1031682
comp.os.ms-windows.misc 0.45704221 -6.54596820 -0.5043770 -1.1127980
comp.sys.ibm.pc.hardware 0.29866502 -5.55952654 -0.8972155 -9.0253191
comp.sys.mac.hardware 0.45269879 -3.32835314 -5.5403978 -5.1908325
comp.windows.x 0.63847150 -4.40205097 -2.4856457 -5.2730532
misc.forsale 0.59012615 4.19061796 -3.8970935 -5.1713703
rec.autos 0.34494436 -2.60865621 -0.6769143 -3.8443432
rec.motorcycles 0.55637401 -1.86907459 -2.6722765 3.8598477
rec.sport.baseball 0.64776087 1.85164391 -5.5293810 -0.6844863
rec.sport.hockey 0.26954741 -0.06203175 -0.3972151 8.4481839
sci.crypt 0.07929986 3.60519397 2.1716362 -5.4979043
sci.electronics 0.31714050 -1.33324287 0.7805920 -4.8241339
sci.med 0.29779539 4.31605496 -0.8727921 2.9589061
sci.space 0.20997964 2.39444885 1.9581433 -1.1467414
soc.religion.christian -0.02225500 1.11565767 3.6428443 0.7498387
talk.politics.guns 0.11895828 5.39121244 -0.1921129 -0.9881931
talk.politics.mideast 0.03808929 -0.84748051 2.0709237 0.6018028
talk.politics.misc -0.21078137 1.37337160 4.5629442 0.6928847
talk.religion.misc -0.36185050 0.03655827 0.5980048 0.5054213
x4 x5 x6 x7
x8
comp.graphics 9.033334 7.0192748 0.16837265 0.2187211
-3.6491393
comp.os.ms-windows.misc 7.105049 0.2416844 6.47463665 2.6865188
0.9813021
comp.sys.ibm.pc.hardware 11.208601 5.1641788 6.89008112 6.8821034
-1.1563185
comp.sys.mac.hardware 10.960475 -1.9092041 2.40371401 -1.0299831
-0.3979949
comp.windows.x 7.831094 0.1186285 0.23514096 -1.6337241
-3.9838201
misc.forsale 5.769901 6.7159812 -1.93968915 2.6798810
1.1244947
rec.autos 3.865056 0.1188188 6.30501368 1.1181748
-4.0945192
rec.motorcycles 4.370033 1.3509225 -2.51137387 1.7889965
-0.2231039
rec.sport.baseball 4.905747 1.7073982 6.01976440 -2.0471183
-4.9937091
rec.sport.hockey 4.575955 -0.0869318 -1.85252683 2.7528117
-3.3982038
sci.crypt 8.485397 3.4533590 2.37158269 -0.7041111
-6.4355892
sci.electronics 8.807797 -1.1365877 1.01298828 2.3025815
-3.7304551
sci.med 4.251672 1.6995090 3.30625008 -0.8523060
-2.3018664
sci.space 9.040464 5.9714280 2.79637661 0.8799763
-5.8195717
soc.religion.christian 6.907276 0.4557410 -0.98188302 -2.1058969
-4.5960733
talk.politics.guns 7.149903 -2.9766376 1.87073733 4.8228957
-3.6893940
talk.politics.mideast 6.050361 1.9889525 -3.17168544 -2.3966889
8.1085935
talk.politics.misc 8.555592 1.4756449 -0.47450379 3.8521708
-3.4629278
talk.religion.misc 9.774690 4.1712967 0.03906819 1.2280719
-6.7910743
x9 x10 x11 x12
comp.graphics -0.39312040 -3.2572069 -7.4439559 -5.5429428
comp.os.ms-windows.misc 4.09046505 -6.4363098 -6.2324448 -6.1287834
comp.sys.ibm.pc.hardware 0.02015251 -1.9149433 -3.6751577 -5.2145510
comp.sys.mac.hardware 3.81277265 -9.4835151 -5.6315064 -2.3164148
comp.windows.x 3.33875226 -12.5747792 -9.4202176 -3.1903203
misc.forsale 0.08603018 -6.1517657 -5.6218740 -2.2750419
rec.autos 0.88637347 -3.8640771 -3.7595583 -2.1053560
rec.motorcycles -2.31446575 -11.3232215 -4.9337654 -3.2533249
rec.sport.baseball 0.47952604 -4.4506352 -3.7291590 -13.3192078
rec.sport.hockey 1.56687927 -3.4629955 0.8472809 -5.4775734
sci.crypt 2.56905183 -6.1068418 0.8769075 -1.4013213
sci.electronics -2.93202536 1.9462835 -4.2089267 -2.1560806
sci.med 3.72384120 -5.6507869 -6.4614948 -2.5490348
sci.space -2.49644737 -4.9534675 -5.6519521 -2.6338873
soc.religion.christian 2.67908535 1.1556453 6.6814628 -3.3196256
talk.politics.guns -1.10974910 -5.4778037 -2.9293610 -1.1856183
talk.politics.mideast 5.72453312 -2.6987484 0.3532723 -4.3997810
talk.politics.misc 1.18303190 -5.3802093 -3.4711458 0.4928157
talk.religion.misc 0.45942122 -0.1637296 3.4285962 -6.9764489
x13 x14 x15
comp.graphics -3.287707 0.35639470 -6.91246416
comp.os.ms-windows.misc -5.875776 -3.33726744 -6.89408072
comp.sys.ibm.pc.hardware -11.347861 -2.39922482 -1.47182928
comp.sys.mac.hardware -3.979639 -0.05450003 -0.06150157
comp.windows.x -1.753831 -2.01337073 -3.80837072
misc.forsale -19.850634 0.98566399 -7.73691044
rec.autos -5.288611 3.52664949 1.49950946
rec.motorcycles -4.760638 2.91862795 -3.38732795
rec.sport.baseball -13.446201 -6.45080449 -1.48063043
rec.sport.hockey -9.746940 -1.58992997 3.76260962
sci.crypt 1.106581 0.31910767 0.27873970
sci.electronics -7.679123 1.32204155 0.72514034
sci.med -2.625328 -3.55163264 -0.44419996
sci.space -7.155133 1.61298084 5.15959340
soc.religion.christian -3.812396 1.15236988 1.45284026
talk.politics.guns -6.649506 -2.65037735 7.37585392
talk.politics.mideast -2.596350 -2.90530800 2.50239590
talk.politics.misc -1.123491 1.02627441 4.08961917
talk.religion.misc -3.643682 -0.33512039 1.78958648
Residual Deviance: 66003.82
AIC: 66611.82
Also, I did some experiments with clusters constructed based on each
newsgroup:
> m=aggregate(x0[,3:1002], by=list(group=x0$group), FUN=mean)
> plot(apply(m[,2:1001], MARGIN=1, FUN=function(v)
{sqrt(sum((v-x0[9000,3:1002])^2))}))
> (x0$group[9000])
[1] soc.religion.christian
20 Levels: alt.atheism comp.graphics ... talk.religion.misc
> as.numeric(x0$group[9000])
[1] 16
Note that element 16 is the lowest. Note that the second lowest elements
are related groups.