[Wikimedia-l] Re: [Foundation-l] Improving search in Wikipedia through quality and concept discovery

Brian Mingus Sat, 10 Sep 2022 10:02:59 -0700

Your the search guy?

Why did you marginalize my work?


On Sun, Nov 1, 2009 at 9:15 AM Robert Stojnic <rainma...@gmail.com> wrote:

>
> Hi Brian,
>
> I'm not sure this is foundation-l type of discussion, but let me give a
> couple of comments.
> I took the liberty of re-running your sample query "hippie" using google
> and built-in search on simple.wp, here are the results I got for top 10
> hits:
>
> Google:  [1]
> Hippie, Human Be-In, Woodstock Festival, South Park, Summer of Love,
> Lysergic acid diethylamide, Across the Universe (movie), Glam rock,
> Wikipedia:Simple talk/Archive 27, Morris Gleitzman
>
> simple.wikipedia.org: [2]
> Hippie, Flower Power, Counterculture, Human Be-In, Summer of Love,
> Woodstock Festival, San Francisco California, Glam Rock, Psychedelic
> pop, Neal Cassady
>
> LDA (your method results from your e-mail):
> Acid rock, Aldeburgh Festival, Anne Murray, Carl Radle, Harry Nilsson,
> Jack Kerouac, Phil Spector, Plastic Ono Band, Rock and Roll, Salvador
> Allende, Smothers brothers, Stanley Kubrick
>
> Personally, I think the results provided by the internal search engine
> are the best, maybe even slightly better than google's, and I'm not sure
> what kind of relatedness LDA captures.
>
> If we were to systematically benchmark these methods on en.wp I think
> google would be better than internal search, mainly because it can
> extract information from pages that link to wikipedia (which apparently
> doesn't work as well for simple.wp). But that is beside the point here.
>
> I think it is interesting that you found that certain classes of pages
> (e.g. featured articles) could be predicted from some statistical
> properties, although I'm not sure how big is your false discovery rate.
>
> In any case, if you do want to work on improving the search engine and
> classification of articles, here are some ideas I think are worth
> pursuing and problems worth solving:
>
> * integrating trends into search results - if one searches for "plane
> crash" a day after a plane crashes, he should get first hit that plane
> crash and not some random plane crash from 10 years ago - we can
> conclude this is the one he wants because it is likely that this page is
> going to get a lots of page hits. So, this boils down to: integrate page
> hit data into search results in a way that is robust and hard to
> manipulate (e.g. by running a bot or refreshing a page million times)
>
> * morphological and context-dependent analysis, if a user enters a query
> like "douglas adams book" what are the concepts in this query? Should we
> group the query like [(douglas adams) (book)] or [(douglas) (adams
> book)]? Can we devise a general rule that will quickly and reliably
> separate the query into parts that are related to each other, and then
> use those to search through the article space to find the most relevant
> articles?
>
> * technical challenges: can we efficiently index expanded article with
> templates, can we make efficient category intersection (w/o subcategories)
>
> * extracting information: what kinds of information is in wikipedia, how
> do we properly extract it and index it? What about chemical formulas,
> geographical locations, computer code, stuff in templates, tables, image
> captions, mathematical formulas....
>
> * how can we improve on the language model? Can we have smarter stemming
> and word disambiguation (compare shares in "shares and bonds" vs  "John
> shares a cookie"). What about synonyms and acronyms? Can we improve on
> the language model "did you mean..." is using to correlate related words?
>
> Hope this helps,
>
> Cheers, robert (a.k.a "the search guy")
>
> [1] http://www.google.co.uk/search?q=hippie+site%3Asimple.wikipedia.org
> [2]
>
> http://simple.wikipedia.org/w/index.php?title=Special%3ASearch&search=Hippie&fulltext=Search
>
>
> Brian J Mingus wrote:
> > This paper (first reference) is the result of a class project I was part
> of
> > almost two years ago for CSCI 5417 Information Retrieval Systems. It
> builds
> > on a class project I did in CSCI 5832 Natural Language Processing and
> which
> > I presented at Wikimania '07. The project was very late as we didn't send
> > the final paper in until the day before new years. This technical report
> was
> > never really announced that I recall so I thought it would be
> interesting to
> > look briefly at the results. The goal of this paper was to break articles
> > down into surface features and latent features and then use those to
> study
> > the rating system being used, predict article quality and rank results
> in a
> > search engine. We used the [[random forests]] classifier which allowed
> us to
> > analyze the contribution of each feature to performance by looking
> directly
> > at the weights that were assigned. While the surface analysis was
> performed
> > on the whole english wikipedia, the latent analysis was performed on the
> > simple english wikipedia (it is more expensive to compute). = Surface
> > features = * Readability measures are the single best predictor of
> quality
> > that I have found, as defined by the Wikipedia Editorial Team (WET). The
> > [[Automated Readability Index]], [[Gunning Fog Index]] and
> [[Flesch-Kincaid
> > Grade Level]] were the strongest predictors, followed by length of
> article
> > html, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]],
> number
> > of internal links, [[Laesbarhedsindex Readability Formula]], number of
> words
> > and number of references. Weakly predictive were number of to be's,
> number
> > of sentences, [[Coleman-Liau Index]], number of templates, PageRank,
> number
> > of external links, number of relative links. Not predictive (overall -
> see
> > the end of section 2 for the per-rating score breakdown): Number of h2 or
> > h3's, number of conjunctions, number of images*, average word length,
> number
> > of h4's, number of prepositions, number of pronouns, number of
> interlanguage
> > links, average syllables per word, number of nominalizations, article age
> > (based on page id), proportion of questions, average sentence length. :*
> > Number of images was actually by far the single strongest predictor of
> any
> > class, but only for Featured articles. Because it was so good at picking
> out
> > featured articles and somewhat good at picking out A and G articles the
> > classifier was confused in so many cases that the overall contribution of
> > this feature to classification performance is zero. :* Number of external
> > links is strongly predictive of Featured articles. :* The B class is
> highly
> > distinctive. It has a strong "signature," with high predictive value
> > assigned to many features. The Featured class is also very distinctive.
> F, B
> > and S (Stop/Stub) contain the most information.
> >  :* A is the least distinct class, not being very different from F or G.
> =
> > Latent features = The algorithm used for latent analysis, which is an
> > analysis of the occurence of words in every document with respect to the
> > link structure of the encyclopedia ("concepts"), is [[Latent Dirichlet
> > Allocation]]. This part of the analysis was done by CS PhD student Praful
> > Mangalath. An example of what can be done with the result of this
> analysis
> > is that you provide a word (a search query) such as "hippie". You can
> then
> > look at the weight of every article for the word hippie. You can pick the
> > article with the largest weight, and then look at its link network. You
> can
> > pick out the articles that this article links to and/or which link to
> this
> > article that are also weighted strongly for the word hippie, while also
> > contributing maximally to this articles "hippieness". We tried this
> query in
> > our system (LDA), Google (site:en.wikipedia.org hippie), and the Simple
> > English Wikipedia's Lucene search engine. The breakdown of articles
> occuring
> > in the top ten search results for this word for those engines is: * LDA
> > only: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[Carl
> > Radle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[Plastic
> > Ono Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers
> brothers]],
> > [[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. *
> Simple
> > only: [[African Americans]], [[Charles Manson]], [[Counterculture]],
> [[Drug
> > use]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexual
> > liberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],
> > [[Human Be-in]], [[Students for a democratic society]], [[Woodstock
> > festival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple:
> [[Lysergic
> > acid diethylamide]], [[Summer of Love]] ( See the paper for the articles
> > produced for the keywords philosophy and economics ) = Discussion /
> > Conclusion = * The results of the latent analysis are totally up to your
> > perception. But what is interesting is that the LDA features predict the
> WET
> > ratings of quality just as well as the surface level features. Both
> feature
> > sets (surface and latent) both pull out all almost of the information
> that
> > the rating system bears. * The rating system devised by the WET is not
> > distinctive. You can best tell the difference between, grouped together,
> > Featured, A and Good articles vs B articles. Featured, A and Good
> articles
> > are also quite distinctive (Figure 1). Note that in this study we didn't
> > look at Start's and Stubs, but in earlier paper we did. :* This is
> > interesting when compared to this recent entry on the YouTube blog. "Five
> > Stars Dominate Ratings"
> >
> http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html:*
> > I think a sane, well researched (with actual subjects) rating system
> > is
> > well within the purview of the Usability Initiative. Helping people find
> and
> > create good content is what Wikipedia is all about. Having a solid rating
> > system allows you to reorganized the user interface, the Wikipedia
> > namespace, and the main namespace around good content and bad content as
> > needed. If you don't have a solid, information bearing rating system you
> > don't know what good content really is (really bad content is easy to
> spot).
> > :* My Wikimania talk was all about gathering data from people about
> articles
> > and using that to train machines to automatically pick out good content.
> You
> > ask people questions along dimensions that make sense to people, and give
> > the machine access to other surface features (such as a statistical
> measure
> > of readability, or length) and latent features (such as can be derived
> from
> > document word occurence and encyclopedia link structure). I referenced
> page
> > 262 of Zen and the Art of Motorcycle Maintenance to give an example of
> the
> > kind of qualitative features I would ask people. It really depends on
> what
> > features end up bearing information, to be tested in "the lab". Each
> word is
> > an example dimension of quality: We have "*unity, vividness, authority,
> > economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,
> > precision, proportion, depth and so on.*" You then use surface and latent
> > features to predict these values for all articles. You can also say,
> when a
> > person rates this article as high on the x scale, they also mean that it
> has
> > has this much of these surface and these latent features.
> >
> > = References =
> >
> >
> >    - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in
> >    Wikipedia through quality and concept discovery*. Technical Report.
> > PDF<
> http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalathMingus08.pdf
> >
> >    - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
> >    feasibility of automatically rating online article quality*. Technical
> >    Report. PDF<
> http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincockMingus07.pdf
> >
> > _______________________________________________
> > foundation-l mailing list
> > foundatio...@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> >
> >
>
>
> _______________________________________________
> foundation-l mailing list
> foundatio...@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/34X4J6XU6QWF4MCYCZKQGTAY74LR3MGN/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: [Foundation-l] Improving search in Wikipedia through quality and concept discovery

Reply via email to