Your the search guy?
Why did you marginalize my work?
On Sun, Nov 1, 2009 at 9:15 AM Robert Stojnic wrote:
>
> Hi Brian,
>
> I'm not sure this is foundation-l type of discussion, but let me give a
> couple of comments.
> I took the liberty of re-running your sample query "hippie" using google
> and built-in search on simple.wp, here are the results I got for top 10
> hits:
>
> Google: [1]
> Hippie, Human Be-In, Woodstock Festival, South Park, Summer of Love,
> Lysergic acid diethylamide, Across the Universe (movie), Glam rock,
> Wikipedia:Simple talk/Archive 27, Morris Gleitzman
>
> simple.wikipedia.org: [2]
> Hippie, Flower Power, Counterculture, Human Be-In, Summer of Love,
> Woodstock Festival, San Francisco California, Glam Rock, Psychedelic
> pop, Neal Cassady
>
> LDA (your method results from your e-mail):
> Acid rock, Aldeburgh Festival, Anne Murray, Carl Radle, Harry Nilsson,
> Jack Kerouac, Phil Spector, Plastic Ono Band, Rock and Roll, Salvador
> Allende, Smothers brothers, Stanley Kubrick
>
> Personally, I think the results provided by the internal search engine
> are the best, maybe even slightly better than google's, and I'm not sure
> what kind of relatedness LDA captures.
>
> If we were to systematically benchmark these methods on en.wp I think
> google would be better than internal search, mainly because it can
> extract information from pages that link to wikipedia (which apparently
> doesn't work as well for simple.wp). But that is beside the point here.
>
> I think it is interesting that you found that certain classes of pages
> (e.g. featured articles) could be predicted from some statistical
> properties, although I'm not sure how big is your false discovery rate.
>
> In any case, if you do want to work on improving the search engine and
> classification of articles, here are some ideas I think are worth
> pursuing and problems worth solving:
>
> * integrating trends into search results - if one searches for "plane
> crash" a day after a plane crashes, he should get first hit that plane
> crash and not some random plane crash from 10 years ago - we can
> conclude this is the one he wants because it is likely that this page is
> going to get a lots of page hits. So, this boils down to: integrate page
> hit data into search results in a way that is robust and hard to
> manipulate (e.g. by running a bot or refreshing a page million times)
>
> * morphological and context-dependent analysis, if a user enters a query
> like "douglas adams book" what are the concepts in this query? Should we
> group the query like [(douglas adams) (book)] or [(douglas) (adams
> book)]? Can we devise a general rule that will quickly and reliably
> separate the query into parts that are related to each other, and then
> use those to search through the article space to find the most relevant
> articles?
>
> * technical challenges: can we efficiently index expanded article with
> templates, can we make efficient category intersection (w/o subcategories)
>
> * extracting information: what kinds of information is in wikipedia, how
> do we properly extract it and index it? What about chemical formulas,
> geographical locations, computer code, stuff in templates, tables, image
> captions, mathematical formulas
>
> * how can we improve on the language model? Can we have smarter stemming
> and word disambiguation (compare shares in "shares and bonds" vs "John
> shares a cookie"). What about synonyms and acronyms? Can we improve on
> the language model "did you mean..." is using to correlate related words?
>
> Hope this helps,
>
> Cheers, robert (a.k.a "the search guy")
>
> [1] http://www.google.co.uk/search?q=hippie+site%3Asimple.wikipedia.org
> [2]
>
> http://simple.wikipedia.org/w/index.php?title=Special%3ASearch=Hippie=Search
>
>
> Brian J Mingus wrote:
> > This paper (first reference) is the result of a class project I was part
> of
> > almost two years ago for CSCI 5417 Information Retrieval Systems. It
> builds
> > on a class project I did in CSCI 5832 Natural Language Processing and
> which
> > I presented at Wikimania '07. The project was very late as we didn't send
> > the final paper in until the day before new years. This technical report
> was
> > never really announced that I recall so I thought it would be
> interesting to
> > look briefly at the results. The goal of this paper was to break articles
> > down into surface features and latent features and then use those to
> study
> > the rating system being used, predict article quality and rank results
> in a
> > search engine. We used the [[random forests]] classifier which allowed
> us to
> > analyze the contribution of each feature to performance by looking
> directly
> > at the weights that were assigned. While the surface analysis was
> performed
> > on the whole english wikipedia, the latent analysis was performed on the
> > simple english wikipedia (it is more expensive to compute). = Surface
> > features = * Readability measures are the