Re: [Wikitech-ambassadors] Lua script that needs to look up a big table (phonetic guide automation)

bawolff Sun, 15 Mar 2020 19:16:21 -0700

So in this scenario, are all categories planned to be sorted via jyupting?
If so, we could make a collation, in which case categories would
automatically be sorted that way, and you would just put the category in a
page in the normal way (by doing [[Category:Foo]] with no sortkey). The
downside would be that all categories would have to use jyupting.


> mw.loadData format
Its just a normal lua table format.
https://www.mediawiki.org/wiki/Module:ExtensionJson is an example. There
are of course size limits for max sizes of a page (I think its 1 or 2 mb).
But based on the size of
https://raw.githubusercontent.com/MacroYau/PyJyutping/master/pyjyutping/data/jyutping_dictionary.json
you will probably be within the size limits.

> (b) it will be notably faster, because it runs directly on PHP, and not
through the additional layer of Lua.

Personally, I am unconvinced the speed will be significantly different.

--
Brian

On Sun, Mar 15, 2020 at 11:19 PM Huji Lee <[email protected]> wrote:

> The additional information you provided was helpful. I still think the
> best approach is to have an extension that returns the Jyutping value for
> the article title. Let's say that extension introduces a new magic word
> called {{TITLEINJYUPTING}}. That way you can add
> {{DEFAULTSORT:{{TITLEINJYUPTING}}}} to the bottom of the pages; and for
> exceptional pronunciations you can use {{DEFAULTSORT:[special Jyutping
> pronunciation]}} instead. Alternatively, you could make it a parser
> function like {{DEFAULTSORT:{{#JYUPTING:{{PAGETITLE}}}}}} or something like
> that.
>
> If Juytping is as predictable as you state then making an extension should
> be a good idea because (a) it can be used by non-WMF wikis too, without
> having to set up Scrbunto, etc. and (b) it will be notably faster, because
> it runs directly on PHP, and not through the additional layer of Lua.
>
> On Sun, Mar 15, 2020 at 7:03 PM Deryck Chan <[email protected]> wrote:
>
>> bawolff - Would you be able to point me to an example of mw.loadData?
>>
>> Also, I've subscribed to https://phabricator.wikimedia.org/T46667 .
>>
>> Huji - I was inspired by Japanese Wikipedia's approach to sorting - they
>> have a {{DEFAULTSORT:[article name in hiragana]}} on all articles. Since
>> Cantonese pronunciation is even more predictable than Japanese, we could
>> potentially have a template that automatically adds {{DEFAULTSORT:[article
>> title in Jyutping]}} using a Lua lookup table of all common Chinese
>> characters. Exceptional pronunciations should then be coded individually.
>> The Pinyin implementation of this would be equivalent, though it would
>> depend on the zh.wp community agreeing on sorting things by Pinyin.
>>
>> In terms of storing the data, Wikidata is not a good answer. First up,
>> the Wikidata property creators community has rejected the notion of
>> creating separate properties for each common phonetic transcription system
>> of CJK languages, so the retrieval of the phonetic transcriptions from
>> Jyutping will be unnecessarily complicated. Second, Wikidata items refer to
>> concepts, not titles. We could theoretically ask the script to go to
>> Lexemes to fetch the phonetic transcription but that'll involve untangling
>> the multiple Lexemes that refer to the same Chinese character. In general,
>> the way Wikidata is structured makes it a bad fit for the problem at hand.
>>
>> Liangent's formulation of the problem is more general than the one I
>> described, because T46667 aims to allow multiple ways of sorting Chinese
>> characters within the same interface. That will be much welcome too.
>>
>> On Sun, 15 Mar 2020 at 19:55, bawolff <[email protected]> wrote:
>>
>>> Consider using
>>> https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.loadData
>>> , keeping in mind that lua isn't really made with the usecase of huge data
>>> tables in mind, so there might be limits you run into if your data is
>>> really big.
>>>
>>> --
>>> Bawolff
>>>
>>> On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan <[email protected]>
>>> wrote:
>>>
>>>> Hello Ambassadors - This technical question may be relevant to multiple
>>>> (particularly CJK) language communities so I'm asking it here.
>>>>
>>>> What is the advice for writing a Lua script that needs to look up data
>>>> from a big table (~10k rows at first deployment, potentially increasing in
>>>> the future)? Does one hard-code the data into a Lua script, or is there a
>>>> recommended data structure for storing those?
>>>>
>>>> The design problem at hand is that the Cantonese Wikipedia wants to
>>>> re-sort articles by Jyutping rather than Unicode. This will probably
>>>> involve automating the generation of Jyutping phonetic guides by looking up
>>>> the Jyutping transcription of common Chinese characters using a Lua module.
>>>> Where do we store the data?
>>>>
>>>> If another wiki has done similar things, we'd be interested in sharing
>>>> the infrastructure.
>>>>
>>>> Deryck
>>>> On behalf of the Cantonese Wikipedia community
>>>>
>>>> _______________________________________________
>>>> Wikitech-ambassadors mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
>>>>
>>> _______________________________________________
>>> Wikitech-ambassadors mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
>>>
>> _______________________________________________
>> Wikitech-ambassadors mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
>>
>

_______________________________________________
Wikitech-ambassadors mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors

Re: [Wikitech-ambassadors] Lua script that needs to look up a big table (phonetic guide automation)

Reply via email to