BTW, the question set used in the paper can be found here, in a multilingual version with answers: https://github.com/AKSW/hawk/blob/master/resources/qald-4_multilingual_train_withanswers.xml (but not the keywords that the authors extracted for the Wikipedia search here, like in the "Claudia Schiffer, tall" example)
On Wed, Aug 26, 2015 at 10:09 AM, Trey Jones <tjo...@wikimedia.org> wrote: > So I got a copy of the paper (thanks, Phoebe!) and skimmed it quickly, and > I'm not thrilled with the result. > > Their translation of questions into Wikipedia queries was sophisticated > from a language processing point of view, but naive from a search point of > view. "How tall is Claudia Schiffer?" became search terms (Claudia > Schiffer, tall), though any sophisticated searcher should know that height > is usually listed under "height", not "tall". (The query still works > because it gets to the Claudia Schiffer wiki page. The drop the word > "produce" from a question about where beer is produced, but leave it in for > a producer (but don't use "producer", which is the expected specific > title). They also don't take advantage of any knowledge of Wikipedia, and > don't search for the obvious "list of X" articles that often answer the > questions with sortable tables. In one paragraph they mentioned the first > page of 20 results, and in the next they said they only looked at 5. So, > Wikipedia got short shrift, esp. as used by a moderately sophisticated user. > > They did also skew their scores by dropping two queries that were too > complex and computing recall, precision, and F-score without them. > > They didn't seem to mention in this paper the manual effort of mapping > infoboxes to whatever representation they used, and they never mentioned > the computational power required by the human to map the question to the > infobox components and the advantage this gives—again, especially in > comparison to the way they naively adapted the queries to Wikipedia search > terms. A commensurate level of effort put into the wiki searches would give > much much better results. > > Still very interesting food for thought in terms of mapping infoboxes to > properties and entities. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Tue, Aug 25, 2015 at 9:47 AM, Trey Jones <tjo...@wikimedia.org> wrote: > >> On Tue, Aug 25, 2015 at 7:58 AM, Oliver Keyes <oke...@wikimedia.org> >> wrote: >> >>> So it's a comparison of two search systems, neither of which we use? >> >> >> Well, sure... but they describe an interesting search paradigm that I >> don't think we've even been considering (in the available paper). It's not >> the type of query-by-example I'm used to seeing. >> >> They intercept requests for wiki pages and convert infoboxes into >> structured query forms that allow some basic boolean syntax. It converts >> these queries into SPARQL and hits DBpedia to get results. Sounds >> reasonable. >> >> They do mention briefly in section 3.1 (last paragraph) that they >> basically need a custom ("page-dependent") mapping from any given infobox >> to appropriate internal representations for mapping to SPARQL. There are >> some obvious machine learning approaches to try there. Since they don't >> mention any machine learning, I assume they have done them manually, which >> may or may not scale, depending on how many queries of the sort they are >> interested in are covered by *n* manually mapped infobox types. Either way, >> it's potentially brittle, since the Wikipedians tending the infoboxes won't >> know about SWIPE. >> >> As for the comparison to Xser (which I'm not familiar with, though it's >> described here: http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-XuEt2014.pdf ) >> and plain keyword searches in Wikipedia, I'd really need to see the full >> paper to comment properly, but I have some questions (which they may well >> answer in the paper). >> >> Plain keyword searches in Wikipedia are a fine baseline, though I wonder >> if they preprocessed the natural language queries, or just tossed the whole >> question into search (which it is not meant to handle, though it often >> works anyway). And I don't know what counts as success—one of the first *n* >> results contains the answer? How hard would a human have to look on a page >> for the answer? >> >> It seems that the SWIPE system requires a human to translate the query >> into the infobox template (and know which template to use!). So, for the >> query "who has Tom Cruise been married to?" (from the Xser paper), it seems >> the user has to convert "married to" into the "spouse(s)" field of the >> person infobox—which is pushing the NLP processing into the human (of which >> I am a fan, though it is not automatic). >> >> I'm not liking that they claim 96% recall "among all answered >> questions"—you don't get to ignore the ones you failed to answer when >> calculating recall! 100% precision is nice. >> >> Xser seems more like the NLP system I would have first imagined—parse a >> query, convert it into a structured format, and hit the RDF store for >> answers. SWIPE seems to get the human to do the hard parts (parsing and >> converting to a structured format, with the help of existing infoboxes), so >> of course it does better than Xser. >> >> So what do we get out of this? If you haven't already thought of WDQS, >> then you weren't paying attention! We could make things easier (for us, for >> SWIPE, for anyone), if we could develop a standard way to map infobox >> template fields to WDQS properties and contents to entities (someone >> must've thought of this already). >> >> Parsing the content of those fields (if you know what they are supposed >> to contain) is easier than parsing random queries or other chunks of text. >> That info could be used to automatically or semi-automatically populate >> WDQS, or to refer WDQS results back to relevant Wiki pages, or turn >> templates into query forms as SWIPE does. >> >> Whether any of this gets onto our roadmap this century is a different >> question, but there are some interesting things to think about here. >> >> So, can anyone get me a copy of the full paper? >> >> Thanks for the pointer, Tilman! >> >> —Trey >> >> Trey Jones >> Software Engineer, Discovery >> Wikimedia Foundation >> >> >> >>> On 25 August 2015 at 10:54, Tilman Bayer <tba...@wikimedia.org> wrote: >>> > FYI just in case it's of interest and hasn't shown up on the team's >>> radar yet: >>> > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7194368 - >>> > paywalled, unfortunately. >>> > >>> > Quote from the abstract: >>> > >>> > "This paper discusses expressivity and accuracy of the By-Example >>> > Structured (BESt) Query paradigm implemented on the SWiPE system >>> > through the Wikipedia interface. We define an experimental setting >>> > based on the natural language questions made available by the QALD-4 >>> > challenge, in which we compare SWiPE against Xser, a state-of-the-art >>> > Question Answering system, and plain keyword search provided by the >>> > Wikipedia Search Engine. The experiments show that SWiPE outperforms >>> > the results provided by Wikipedia, and it also performs sensibly >>> > better than Xser, obtaining an overall 85% of totally correct answers >>> > vs. 68% of Xser." >>> > >>> > (For context, there's an earlier paper where they describe an earlier >>> > version of that SWiPE - "Search Wikipedia by example" - project: >>> > http://web.cs.ucla.edu/~zaniolo/papers/AtzoriZ12 ) >>> > -- >>> > Tilman Bayer >>> > Senior Analyst >>> > Wikimedia Foundation >>> > IRC (Freenode): HaeB >>> > >>> > _______________________________________________ >>> > Wikimedia-search mailing list >>> > Wikimedia-search@lists.wikimedia.org >>> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >>> >>> >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Wikimedia-search mailing list >>> Wikimedia-search@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >>> >> >> > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > > -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
_______________________________________________ Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search