> > Well, I would expect "phonetic:" would bind with something like IPA, but > the concept of keyword is interesting.
Finding good names for keywords is also an art. "phonetic:" came to mind because the algorithms used to index words by pronunciation are collectively called phonetic algorithms[1]. You could conceivably also map to IPA, but in general the algorithms are much less detailed than IPA, because they are trying to find a balance between inclusivity and exclusivity in grouping similar words (a lot drop non-initial vowels, for example), while IPA is usually much more specific. Mapping IPA into such a system would be interesting. Say I heard someone talk about someone named /ɡədɑfi/—hopefully that would allow me to find Gaddafi (with his famously a hard-to-spell name). Amusingly, the character folding on English Wikipedia maps ɡədɑfi to gadafi, which is a redirect to Gaddafi—so IPA sometimes works now! But it wouldn't work for George /kluni/. We have gone far afield now, but there are Phabricator tickets for advanced search in general[2] and phonetic search specifically[3] if anyone wants to follow up there. [1] https://en.wikipedia.org/wiki/Phonetic_algorithm [2] https://phabricator.wikimedia.org/T174064 [3] https://phabricator.wikimedia.org/T174705 On Thu, Sep 21, 2017 at 6:50 AM, mathieu stumpf guntz < [email protected]> wrote: > > > Le 20/09/2017 à 03:40, Trey Jones a écrit : > > Anyway, would it be a big deal to show the transliterated results with >> less weight in ranking? > > > Doing any special weighting would be more difficult, but they would > already be naturally ranked lower for not being exact matches. (You can see > this at work if you compare the results for *resume, resumé,* and *résumé* > on English Wikipedia, for example.) > > Interesting to know. Thank you. > > > Actually, add an option button in advanced search in any case, and just >> limit discussion about should it be opt-in or opt-out. > > > There are longer term plans for revamping advanced search capabilities, so > if we want to go that route, it's doable, but it would definitely be on > hold for a while. Options that have been mentioned include a special case > keyword like "kana:オオカミ", or a more generic keyword like "phonetic:オオカミ" > that was smart enough to know what to do with kana, but might do something > different with other characters... but that's all at the vague ideation > stage right now. > > Well, I would expect "phonetic:" would bind with something like IPA, but > the concept of keyword is interesting. > > > Thanks! > > > Trey Jones > Sr. Software Engineer, Search Platform > Wikimedia Foundation > > On Tue, Sep 19, 2017 at 8:29 PM, mathieu stumpf guntz < > [email protected]> wrote: > >> >> >> Le 19/09/2017 à 23:47, Trey Jones a écrit : >> >> We recently got a suggestion via Phabricator[1] to automatically map >> between hiragana and katakana when searching on English Wikipedia and other >> wiki projects. As an always-on feature, this isn't difficult to implement, >> but major commercial search engines (Google.jp, Bing, Yahoo Japan, >> DuckDuckGo, Goo) don't do that. They give different results when searching >> for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give >> different *numbers* of results, seeming to indicate that it's not just >> re-ordering the same results (say, so that results in the same script are >> ranked higher).[2] I want to know what they know that I don't! >> >> Does anyone have any thoughts on whether this would be useful (seems that >> it would) and whether it would cause any problems (it must, or otherwise >> all the other search engines would do it, right?). >> >> Well, maybe. Or not. Look how Duckduckgo continue to only give a >> "country" option to filter *languages*. Now both might be complementary, >> but personally I'm generally more interested with the later. All the more >> when >> I'm using a language which have no country using it as official language. >> :) >> >> Anyway, would it be a big deal to show the transliterated results with >> less >> weight in ranking? Actually, add an option button in advanced search in >> any >> case, and just limit discussion about should it be opt-in or opt-out. >> >> Any idea why it might be different between a Japanese-language wiki and a >> non-Japanese-language wiki? We often are more aggressive in matching >> between characters that are not native to a given language--for example, >> accents on Latin characters are generally ignored on English-language >> wikis. So it might make sense to merge hiragana and katakana on >> English-language wikis but not Japanese-language wikis. >> >> Thanks very much for any suggestions or information! >> —Trey >> >> >> どういたしました。 >> >> >> >> [1] https://phabricator.wikimedia.org/T176197 >> [2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309 >> >> Trey Jones >> Sr. Software Engineer, Search Platform >> Wikimedia Foundation >> _______________________________________________ >> Wikitech-l mailing >> [email protected]https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> > > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
