Re: [Wikitech-l] Mapping Hiragana and Katakana

Trey Jones Thu, 21 Sep 2017 06:52:31 -0700

>
> Well, I would expect "phonetic:" would bind with something like IPA, but
> the concept of keyword is interesting.



Finding good names for keywords is also an art. "phonetic:" came to mind
because the algorithms used to index words by pronunciation are
collectively called phonetic algorithms[1]. You could conceivably also map
to IPA, but in general the algorithms are much less detailed than IPA,
because they are trying to find a balance between inclusivity and
exclusivity in grouping similar words (a lot drop non-initial vowels, for
example), while IPA is usually much more specific.

Mapping IPA into such a system would be interesting. Say I heard someone
talk about someone named /ɡədɑfi/—hopefully that would allow me to find
Gaddafi (with his famously a hard-to-spell name). Amusingly, the character
folding on English Wikipedia maps ɡədɑfi to gadafi, which is a redirect to
Gaddafi—so IPA sometimes works now! But it wouldn't work for George /kluni/.

We have gone far afield now, but there are Phabricator tickets for advanced
search in general[2] and phonetic search specifically[3] if anyone wants to
follow up there.

[1] https://en.wikipedia.org/wiki/Phonetic_algorithm
[2] https://phabricator.wikimedia.org/T174064
[3] https://phabricator.wikimedia.org/T174705


On Thu, Sep 21, 2017 at 6:50 AM, mathieu stumpf guntz <
[email protected]> wrote:

>
>
> Le 20/09/2017 à 03:40, Trey Jones a écrit :
>
> Anyway, would it be a big deal to show the transliterated results with
>> less weight in ranking?
>
>
> Doing any special weighting would be more difficult, but they would
> already be naturally ranked lower for not being exact matches. (You can see
> this at work if you compare the results for *resume, resumé,* and *résumé*
> on English Wikipedia, for example.)
>
> Interesting to know. Thank you.
>
>
> Actually, add an option button in advanced search in any case, and just
>> limit discussion about should it be opt-in or opt-out.
>
>
> There are longer term plans for revamping advanced search capabilities, so
> if we want to go that route, it's doable, but it would definitely be on
> hold for a while. Options that have been mentioned include a special case
> keyword like "kana:オオカミ", or a more generic keyword like "phonetic:オオカミ"
> that was smart enough to know what to do with kana, but might do something
> different with other characters... but that's all at the vague ideation
> stage right now.
>
> Well, I would expect "phonetic:" would bind with something like IPA, but
> the concept of keyword is interesting.
>
>
> Thanks!
>
>
> Trey Jones
> Sr. Software Engineer, Search Platform
> Wikimedia Foundation
>
> On Tue, Sep 19, 2017 at 8:29 PM, mathieu stumpf guntz <
> [email protected]> wrote:
>
>>
>>
>> Le 19/09/2017 à 23:47, Trey Jones a écrit :
>>
>> We recently got a suggestion via Phabricator[1] to automatically map
>> between hiragana and katakana when searching on English Wikipedia and other
>> wiki projects. As an always-on feature, this isn't difficult to implement,
>> but major commercial search engines (Google.jp, Bing, Yahoo Japan,
>> DuckDuckGo, Goo) don't do that. They give different results when searching
>> for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give
>> different *numbers* of results, seeming to indicate that it's not just
>> re-ordering the same results (say, so that results in the same script are
>> ranked higher).[2] I want to know what they know that I don't!
>>
>> Does anyone have any thoughts on whether this would be useful (seems that
>> it would) and whether it would cause any problems (it must, or otherwise
>> all the other search engines would do it, right?).
>>
>> Well, maybe. Or not. Look how Duckduckgo continue to only give a
>> "country" option to filter *languages*. Now both might be complementary,
>> but personally I'm generally more interested with the later. All the more
>> when
>> I'm using a language which have no country using it as official language.
>> :)
>>
>> Anyway, would it be a big deal to show the transliterated results with
>> less
>> weight in ranking? Actually, add an option button in advanced search in
>> any
>> case, and just limit discussion about should it be opt-in or opt-out.
>>
>> Any idea why it might be different between a Japanese-language wiki and a
>> non-Japanese-language wiki? We often are more aggressive in matching
>> between characters that are not native to a given language--for example,
>> accents on Latin characters are generally ignored on English-language
>> wikis. So it might make sense to merge hiragana and katakana on
>> English-language wikis but not Japanese-language wikis.
>>
>> Thanks very much for any suggestions or information!
>> —Trey
>>
>>
>> どういたしました。
>>
>>
>>
>> [1] https://phabricator.wikimedia.org/T176197
>> [2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309
>>
>> Trey Jones
>> Sr. Software Engineer, Search Platform
>> Wikimedia Foundation
>> _______________________________________________
>> Wikitech-l mailing 
>> [email protected]https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Mapping Hiragana and Katakana

Reply via email to