Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler
Am 25.11.2016 um 12:16 schrieb David Cuenca Tudela:
>> If we want to avoid this complexity, we could just go by prefix. So if the
>> languages is "de", variants like "de-CH" or "de-DE_old" would be considered 
>> ok.
>> Ordering these alphabetically would put the "main" code (with no suffix) 
>> first.
>> May be ok for a start.
> 
> I find this issue potentially controversial, and I think that the community at
> large should be involved in this matter to avoid future dissatisfaction and to
> promote involvement in the decision-making.

We should absolutely discuss this with Wiktionarians. My suggestion was intended
as a baseline implementation. Details about the restrictions on which variants
are allowed on a Lexeme, or in what order they are shown, can be changed later
without breaking anything.

> In my opinion it would be more appropriate to use standardized language codes,
> and then specify the dialect with an item, as it provides greater flexibility.
> However, as mentioned before I would prefer if this topic in particular would 
> be
> discussed with wiktionarians.

Using Items to represent dialects is going to be tricky. We need ISO language
codes for use in HTML and RDF. We can somehow map between Items and ISO codes,
but that's going to be messy, especially when that mapping changes.

So it seems like we need to further discuss how to represent a Lexeme's language
and each lemma's variant. My current thinking is to represent the language as an
Item reference, and the variant as an ISO code. But you are suggesting the
opposite.

I can see why one would want items for dialects, but I currently have no good
idea for making this work with the existing technology. Further investigation is
needed.

I have filed a Phabricator task for investiagting this. I suggest to take the
discussion about how to represent languages/variants/dialects/etc there:

https://phabricator.wikimedia.org/T151626

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Thiemo Mättig
Hi all!

I tweaked my part of the decision matrix a little bit:

https://docs.google.com/spreadsheets/d/1PtGkt6E8EadCoNvZLClwUNhCxC-cjTy5TY8seFVGZMY/edit?ts=5834219d#gid=868938568

The arguments in my matrix are basically a collection of "the worst
things that can happen". I like this approach. ;-)

The arguments I consider most important (they should have a high
number in the last column) are:

1. Changing Term to TermList later is almost impossible. This alone
could be set to a "-100" and make all the other arguments obsolete.

2. I'm very much concerned about any UI consuming Lemmas becoming very
complicated, both from the users and devs perspective. When a Lexeme
allows any number of Lemmas, should this include zero Lemmas? Which
language codes will be allowed? Do we want to enforce at least one
Lemma? Do we need to validate the used language codes, or are
post-edit checks enough? Do we even have standardized language codes
for all variants? Is it possible to have multiple Lemmas with the same
language code? Which Lemma is the primary one then? How to deprecate
one?

The list goes on.

All this sounds like we are going to reimplement the majority of the
statements UI, just without Ranks, Qualifiers and References.

Third-party devs will also have to deal with all these problems (also
see Dennys comments).

I suggest to use a TermList anyway, but to start with a very hard
limitation: It *must* contain exactly one element, and the language
code *must* be the exact same as the language code of the Lexeme. We
can lift all these limitations later when needed, step by step.

Best
Thiemo

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread David Cuenca Tudela
> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
first.
> May be ok for a start.

I find this issue potentially controversial, and I think that the community
at large should be involved in this matter to avoid future dissatisfaction
and to promote involvement in the decision-making.

For languages there are regulatory bodies that assign codes, but for
varieties it is not the case, or at least not totally. Even under the en-gb
there are many varieties and dialects
https://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language#United_Kingdom

In my opinion it would be more appropriate to use standardized language
codes, and then specify the dialect with an item, as it provides greater
flexibility. However, as mentioned before I would prefer if this topic in
particular would be discussed with wiktionarians.


Thanks for moving this forward!

David



On Fri, Nov 25, 2016 at 11:45 AM, Daniel Kinzler <
daniel.kinz...@wikimedia.de> wrote:

> Thank you Denny for having an open mind! And sorry for being a nuisance ;)
>
> I think it's very important to have controversial but constructive
> discussions
> about these things. Data models are very hard to change even slightly once
> people have started to create and use the data. We need to try hard to get
> it as
> right as possible off the bat.
>
> Some remarks inline below.
>
> Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> > There is one thing that worries me about the multi-lemma approach, and
> that are
> > mentions of a discussion about ordering. If possible, I would suggest
> not to
> > have ordering in every single Lexeme or even Form, but rather to use the
> > following solution:
> >
> > If I understand it correctly, we won't let every Lexeme have every
> arbitrary
> > language anyway, right? Instead we will, for each language that has
> variants
> > have somewhere in the configurations an explicit list of these variants,
> i.e.
> > say, for English it will be US, British, etc., for Portuguese Brazilian
> and
> > Portuguese, etc.
>
> That approach is similar to what we are now doing for sorting Statement
> groups
> on Items. There is a global ordering of properties defined on a wiki page.
> So
> the community can still fight over it, but only in one place :) We can
> re-order
> based on user preference using a Gadget.
>
> For the multi-variant lemmas, we need to declare the Lexeme's language
> separately, in addition to the language code associated with each lemma
> variant.
> It seems like the language will probably represented as reference to a
> Wikidata
> Item (that is, a Q-Id). That Item can be associated with an (ordered) list
> of
> matching language codes, via Statements on the Item, or via configuration
> (or,
> like we do for unit conversion, configuration generated from Statements on
> Items).
>
> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
> considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
> first.
> May be ok for a start.
>
> I'm not sure yet on what level we want to enforce the restriction on
> language
> codes. We can do it just before saving new data (the "validation" step),
> or we
> could treat it as a community enforced soft constraint. I'm tending
> towards the
> former, though.
>
> > Given that, we can in that very same place also define their ordering
> and their
> > fallbacks.
>
> Well, all lemmas would fall back on each other, the question is just which
> ones
> should be preferred. Simple heuristic: prefer the shortest language code.
> Or go
> by what MediaWiki does fro the UI (which is what we do for Item labels).
>
> > The upside is that it seems that this very same solution could also be
> used for
> > languages with different scripts, like Serbian, Kazakh, and Uzbek
> (although it
> > would not cover the problems with Chinese, but that wasn't solved
> previously
> > either - so the situation is strictly better). (It doesn't really solve
> all
> > problems - there is a reason why ISO treats language variants and scripts
> > independently - but it improves on the vast majority of the problematic
> cases).
>
> Yes, it's not the only decision we have to make in this regard, but the
> most
> fundamental one, I think.
>
> One consequence of this is that Forms should probably also allow multiple
> representations/spellings. This is for consistency with the lemma, for code
> re-use, and for compatibility with Lemon.
>
> > So, given that we drop any local ordering in the UI and API, I think that
> > staying close to Lemon and choosing a TermList seems currently like the
> most
> > promising approach to me, and I changed my mind.
>
> Knowing that you won't do that without a good reason, I thank you for the
> 

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler
Thank you Denny for having an open mind! And sorry for being a nuisance ;)

I think it's very important to have controversial but constructive discussions
about these things. Data models are very hard to change even slightly once
people have started to create and use the data. We need to try hard to get it as
right as possible off the bat.

Some remarks inline below.

Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> There is one thing that worries me about the multi-lemma approach, and that 
> are
> mentions of a discussion about ordering. If possible, I would suggest not to
> have ordering in every single Lexeme or even Form, but rather to use the
> following solution:
> 
> If I understand it correctly, we won't let every Lexeme have every arbitrary
> language anyway, right? Instead we will, for each language that has variants
> have somewhere in the configurations an explicit list of these variants, i.e.
> say, for English it will be US, British, etc., for Portuguese Brazilian and
> Portuguese, etc.

That approach is similar to what we are now doing for sorting Statement groups
on Items. There is a global ordering of properties defined on a wiki page. So
the community can still fight over it, but only in one place :) We can re-order
based on user preference using a Gadget.

For the multi-variant lemmas, we need to declare the Lexeme's language
separately, in addition to the language code associated with each lemma variant.
It seems like the language will probably represented as reference to a Wikidata
Item (that is, a Q-Id). That Item can be associated with an (ordered) list of
matching language codes, via Statements on the Item, or via configuration (or,
like we do for unit conversion, configuration generated from Statements on 
Items).

If we want to avoid this complexity, we could just go by prefix. So if the
languages is "de", variants like "de-CH" or "de-DE_old" would be considered ok.
Ordering these alphabetically would put the "main" code (with no suffix) first.
May be ok for a start.

I'm not sure yet on what level we want to enforce the restriction on language
codes. We can do it just before saving new data (the "validation" step), or we
could treat it as a community enforced soft constraint. I'm tending towards the
former, though.

> Given that, we can in that very same place also define their ordering and 
> their
> fallbacks.

Well, all lemmas would fall back on each other, the question is just which ones
should be preferred. Simple heuristic: prefer the shortest language code. Or go
by what MediaWiki does fro the UI (which is what we do for Item labels).

> The upside is that it seems that this very same solution could also be used 
> for
> languages with different scripts, like Serbian, Kazakh, and Uzbek (although it
> would not cover the problems with Chinese, but that wasn't solved previously
> either - so the situation is strictly better). (It doesn't really solve all
> problems - there is a reason why ISO treats language variants and scripts
> independently - but it improves on the vast majority of the problematic 
> cases).

Yes, it's not the only decision we have to make in this regard, but the most
fundamental one, I think.

One consequence of this is that Forms should probably also allow multiple
representations/spellings. This is for consistency with the lemma, for code
re-use, and for compatibility with Lemon.

> So, given that we drop any local ordering in the UI and API, I think that
> staying close to Lemon and choosing a TermList seems currently like the most
> promising approach to me, and I changed my mind. 

Knowing that you won't do that without a good reason, I thank you for the
compliment :)

> My previous reservations still
> hold, and it will lead to some more complexity in the implementation not only 
> of
> Wikidata but also of tools built on top of it,

The complexity of handling a multi-variant lemma is higher than a single string,
but any wikibase client already needs to have the relevant code anyway, to
handle item labels. So I expect little overhead. We'll want the lemma to be
represented in a more compact way in the UI than we currently use for labels,
though.


Thank you all for your help!


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech