Hey Ted, I went back in time a bit and found a version which returned reasonable looking results (at least results which are comparable to those in the book). I ran 'svnversion .' and the older (apparently working version) returned 1004406M whereas the trunk version I am using is 1050223M. In any case, the files in core/o.a.m.classifier.sgd are dated October 4th. It looks like between the 4th of Oct and December 7th there was some refactoring going on. For instance, the encoders were moved to the vectors package (as opposed to the vectorizer.encoders package). I spent a little time comparing diffs in the core sgd package but not enough time to discover what could be causing this behavior.
I hope this helps. Chris On Dec 20, 2010, at 4:25 PM, Ted Dunning wrote: > Yeah... it looks like I really need to jump into this. These results are > not right. > > On Mon, Dec 20, 2010 at 2:11 PM, Chris Schilling <[email protected]> wrote: > >> Hey Ted, >> >> Just FYI, >> >> I changed the Weight subclass of the ModelDissector to sort by true value >> (rather than absolute value) and reran over the 20 newsgroups data. Here >> are the results of the dissector function: >> >> body=rt 0.042 comp.sys.mac.hardware >> body=computer 0.039 sci.electronics >> body=seem 0.035 talk.religion.misc >> body=mike 0.035 misc.forsale >> body=windows 0.034 misc.forsale >> body=just 0.032 sci.crypt >> body=supports 0.032 talk.politics.mideast >> body=x 0.032 talk.religion.misc >> body=do 0.029 rec.motorcycles >> body=university 0.028 comp.sys.mac.hardware >> body=slagle 0.028 rec.sport.hockey >> >> I prefer the results from MIA :) Anyway, I know you are busy. If there is >> anything I can do to help, let me know. Still getting familiar with the >> code, but could help out with some guidance. >> >> Thanks a lot, >> Chris >> >> On Dec 17, 2010, at 7:37 PM, Ted Dunning wrote: >> >>> Hard to say what changed just off hand. I was tweaking the SGD code >> pretty >>> regularly as I learned from the results users were getting. I should >> look >>> at the history to review what happened... some changes may not have been >>> good. >>> >>> On Fri, Dec 17, 2010 at 5:28 PM, Chris Schilling >>> <[email protected]>wrote: >>> >>>> Thanks for the answers Ted. Ill take a look inside the dissector. I >> was >>>> just wondering because the results are quite a bit different from whats >> in >>>> the book - Listing 15.9. Here are those results (where words have >> weights > >>>> 1). >>>> >>>> body=space 2.1 sci.space >>>> body=sale 1.9 misc.forsale >>>> body=car 1.9 rec.autos >>>> body=windows 1.8 comp.os.ms-windows.misc >>>> body=mac 1.7 comp.sys.mac.hardware >>>> body=bike 1.7 rec.motorcycles >>>> body=apple 1.5 comp.sys.mac.hardware >>>> body=gun 1.5 talk.politics.guns >>>> body=baseball 1.5 rec.sport.baseball >>>> body=graphics 1.5 comp.graphics >>>> >>>> >>>> I guess I mostly want to understand what changed. Again, Ill take a >> look >>>> at the dissector, because the results of the training look pretty good. >>>> >> >>
