Your the search guy? Why did you marginalize my work?
On Sun, Nov 1, 2009 at 9:15 AM Robert Stojnic <rainma...@gmail.com> wrote: > > Hi Brian, > > I'm not sure this is foundation-l type of discussion, but let me give a > couple of comments. > I took the liberty of re-running your sample query "hippie" using google > and built-in search on simple.wp, here are the results I got for top 10 > hits: > > Google: [1] > Hippie, Human Be-In, Woodstock Festival, South Park, Summer of Love, > Lysergic acid diethylamide, Across the Universe (movie), Glam rock, > Wikipedia:Simple talk/Archive 27, Morris Gleitzman > > simple.wikipedia.org: [2] > Hippie, Flower Power, Counterculture, Human Be-In, Summer of Love, > Woodstock Festival, San Francisco California, Glam Rock, Psychedelic > pop, Neal Cassady > > LDA (your method results from your e-mail): > Acid rock, Aldeburgh Festival, Anne Murray, Carl Radle, Harry Nilsson, > Jack Kerouac, Phil Spector, Plastic Ono Band, Rock and Roll, Salvador > Allende, Smothers brothers, Stanley Kubrick > > Personally, I think the results provided by the internal search engine > are the best, maybe even slightly better than google's, and I'm not sure > what kind of relatedness LDA captures. > > If we were to systematically benchmark these methods on en.wp I think > google would be better than internal search, mainly because it can > extract information from pages that link to wikipedia (which apparently > doesn't work as well for simple.wp). But that is beside the point here. > > I think it is interesting that you found that certain classes of pages > (e.g. featured articles) could be predicted from some statistical > properties, although I'm not sure how big is your false discovery rate. > > In any case, if you do want to work on improving the search engine and > classification of articles, here are some ideas I think are worth > pursuing and problems worth solving: > > * integrating trends into search results - if one searches for "plane > crash" a day after a plane crashes, he should get first hit that plane > crash and not some random plane crash from 10 years ago - we can > conclude this is the one he wants because it is likely that this page is > going to get a lots of page hits. So, this boils down to: integrate page > hit data into search results in a way that is robust and hard to > manipulate (e.g. by running a bot or refreshing a page million times) > > * morphological and context-dependent analysis, if a user enters a query > like "douglas adams book" what are the concepts in this query? Should we > group the query like [(douglas adams) (book)] or [(douglas) (adams > book)]? Can we devise a general rule that will quickly and reliably > separate the query into parts that are related to each other, and then > use those to search through the article space to find the most relevant > articles? > > * technical challenges: can we efficiently index expanded article with > templates, can we make efficient category intersection (w/o subcategories) > > * extracting information: what kinds of information is in wikipedia, how > do we properly extract it and index it? What about chemical formulas, > geographical locations, computer code, stuff in templates, tables, image > captions, mathematical formulas.... > > * how can we improve on the language model? Can we have smarter stemming > and word disambiguation (compare shares in "shares and bonds" vs "John > shares a cookie"). What about synonyms and acronyms? Can we improve on > the language model "did you mean..." is using to correlate related words? > > Hope this helps, > > Cheers, robert (a.k.a "the search guy") > > [1] http://www.google.co.uk/search?q=hippie+site%3Asimple.wikipedia.org > [2] > > http://simple.wikipedia.org/w/index.php?title=Special%3ASearch&search=Hippie&fulltext=Search > > > Brian J Mingus wrote: > > This paper (first reference) is the result of a class project I was part > of > > almost two years ago for CSCI 5417 Information Retrieval Systems. It > builds > > on a class project I did in CSCI 5832 Natural Language Processing and > which > > I presented at Wikimania '07. The project was very late as we didn't send > > the final paper in until the day before new years. This technical report > was > > never really announced that I recall so I thought it would be > interesting to > > look briefly at the results. The goal of this paper was to break articles > > down into surface features and latent features and then use those to > study > > the rating system being used, predict article quality and rank results > in a > > search engine. We used the [[random forests]] classifier which allowed > us to > > analyze the contribution of each feature to performance by looking > directly > > at the weights that were assigned. While the surface analysis was > performed > > on the whole english wikipedia, the latent analysis was performed on the > > simple english wikipedia (it is more expensive to compute). = Surface > > features = * Readability measures are the single best predictor of > quality > > that I have found, as defined by the Wikipedia Editorial Team (WET). The > > [[Automated Readability Index]], [[Gunning Fog Index]] and > [[Flesch-Kincaid > > Grade Level]] were the strongest predictors, followed by length of > article > > html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], > number > > of internal links, [[Laesbarhedsindex Readability Formula]], number of > words > > and number of references. Weakly predictive were number of to be's, > number > > of sentences, [[Coleman-Liau Index]], number of templates, PageRank, > number > > of external links, number of relative links. Not predictive (overall - > see > > the end of section 2 for the per-rating score breakdown): Number of h2 or > > h3's, number of conjunctions, number of images*, average word length, > number > > of h4's, number of prepositions, number of pronouns, number of > interlanguage > > links, average syllables per word, number of nominalizations, article age > > (based on page id), proportion of questions, average sentence length. :* > > Number of images was actually by far the single strongest predictor of > any > > class, but only for Featured articles. Because it was so good at picking > out > > featured articles and somewhat good at picking out A and G articles the > > classifier was confused in so many cases that the overall contribution of > > this feature to classification performance is zero. :* Number of external > > links is strongly predictive of Featured articles. :* The B class is > highly > > distinctive. It has a strong "signature," with high predictive value > > assigned to many features. The Featured class is also very distinctive. > F, B > > and S (Stop/Stub) contain the most information. > > :* A is the least distinct class, not being very different from F or G. > = > > Latent features = The algorithm used for latent analysis, which is an > > analysis of the occurence of words in every document with respect to the > > link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet > > Allocation]]. This part of the analysis was done by CS PhD student Praful > > Mangalath. An example of what can be done with the result of this > analysis > > is that you provide a word (a search query) such as "hippie". You can > then > > look at the weight of every article for the word hippie. You can pick the > > article with the largest weight, and then look at its link network. You > can > > pick out the articles that this article links to and/or which link to > this > > article that are also weighted strongly for the word hippie, while also > > contributing maximally to this articles "hippieness". We tried this > query in > > our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple > > English Wikipedia's Lucene search engine. The breakdown of articles > occuring > > in the top ten search results for this word for those engines is: * LDA > > only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl > > Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic > > Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers > brothers]], > > [[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * > Simple > > only: [[African Americans]], [[Charles Manson]], [[Counterculture]], > [[Drug > > use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual > > liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]], > > [[Human Be-in]], [[Students for a democratic society]], [[Woodstock > > festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: > [[Lysergic > > acid diethylamide]], [[Summer of Love]] ( See the paper for the articles > > produced for the keywords philosophy and economics ) = Discussion / > > Conclusion = * The results of the latent analysis are totally up to your > > perception. But what is interesting is that the LDA features predict the > WET > > ratings of quality just as well as the surface level features. Both > feature > > sets (surface and latent) both pull out all almost of the information > that > > the rating system bears. * The rating system devised by the WET is not > > distinctive. You can best tell the difference between, grouped together, > > Featured, A and Good articles vs B articles. Featured, A and Good > articles > > are also quite distinctive (Figure 1). Note that in this study we didn't > > look at Start's and Stubs, but in earlier paper we did. :* This is > > interesting when compared to this recent entry on the YouTube blog. "Five > > Stars Dominate Ratings" > > > http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html:* > > I think a sane, well researched (with actual subjects) rating system > > is > > well within the purview of the Usability Initiative. Helping people find > and > > create good content is what Wikipedia is all about. Having a solid rating > > system allows you to reorganized the user interface, the Wikipedia > > namespace, and the main namespace around good content and bad content as > > needed. If you don't have a solid, information bearing rating system you > > don't know what good content really is (really bad content is easy to > spot). > > :* My Wikimania talk was all about gathering data from people about > articles > > and using that to train machines to automatically pick out good content. > You > > ask people questions along dimensions that make sense to people, and give > > the machine access to other surface features (such as a statistical > measure > > of readability, or length) and latent features (such as can be derived > from > > document word occurence and encyclopedia link structure). I referenced > page > > 262 of Zen and the Art of Motorcycle Maintenance to give an example of > the > > kind of qualitative features I would ask people. It really depends on > what > > features end up bearing information, to be tested in "the lab". Each > word is > > an example dimension of quality: We have "*unity, vividness, authority, > > economy, sensitivity, clarity, emphasis, flow, suspense, brilliance, > > precision, proportion, depth and so on.*" You then use surface and latent > > features to predict these values for all articles. You can also say, > when a > > person rates this article as high on the x scale, they also mean that it > has > > has this much of these surface and these latent features. > > > > = References = > > > > > > - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in > > Wikipedia through quality and concept discovery*. Technical Report. > > PDF< > http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalathMingus08.pdf > > > > - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the > > feasibility of automatically rating online article quality*. Technical > > Report. PDF< > http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincockMingus07.pdf > > > > _______________________________________________ > > foundation-l mailing list > > foundatio...@lists.wikimedia.org > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > > > > > > > _______________________________________________ > foundation-l mailing list > foundatio...@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l >
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/34X4J6XU6QWF4MCYCZKQGTAY74LR3MGN/ To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org