On Tue, Jul 12, 2016 at 12:26 PM, David Holland <dholland-t...@netbsd.org> wrote: > On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote: > > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are > > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you > > >> search for kms, you will see results for both kms and km. > > > > > > Interesting problem. > > > > > > I expect the set of documents that contain a word ("directories") and > > > the set of documents containing its true stem ("directory") to overlap > > > widely. I also expect the set of documents that contain a word ("kms") > > > and an incorrect stem ("km") to scarcely overlap. Do the manual pages > > > meet these expections? If so, then maybe you can decide whether or not > > > to keep a stem by looking at the document-set overlap? > > > > Yes, usually when the stem is incorrect, the overlap is not that much. > > But the only way to figure out such cases is manually comparing the > > output of apropos, unless we have a pre-built list of expected > > document-set and we can compare those. :) > > You could build such a list from the current set of man pages, and > refresh it once in a while, and that would probably work well enough.
That's one of the things I want to do. It would be nice to create a labeled dataset, probably something like a set of queries and an expected list of documents in the top 10 for each of them. It could then be used as a training data for tasks such as - evaluating performance of various ranking algorithms, - using machine learning to learn an optimal ranking algorithm - determining which keywords should be stemmed by comparing the overlap of the actual and expected results > I'm wondering though if there's some characteristic of the document > sets you can use to automatically reject wrong stemmings without > having to precompute. What comes to mind though is some kind of > diameter or breadth metric on the image of the document set on the > crossreference graph. Or maybe something like the average > crossreference pagerank of the document set, which if it's too high > means you aren't retrieving useful information. But I guess these > notions aren't much use because I'm sure we don't currently build the > crossreference graph. We haven't tried exploring this aspect. Probably if we have a hand labeled dataset as mentioned above, we could compare performance of page rank as well. - Abhinav