Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-12 Thread Thomas PT
This plan sounds great! Thank you!

A question about the tags used: would it be possible instead of having a 
"mis+Q7654321" internally and "mis" externally to use a private use subtag [1] 
like "mis-x-Q7654321" or "de-x-Q1980305" (or maybe "mis-x-wd-Q7654321" and 
"de-x-wd-Q1980305") that would be used both internally and externally? It has 
the advantage of being a valid BCP-47 code and allowing RDF users to extract 
the exact language (and not only the less very informative "mis"). A variant 
would be to use "x-Q7654321" instead of "mis-x-Q7654321" to avoid the "mis" tag 
entirely.

An other possible way to go: just store the Qid internally and retrieve 
language tag from the item to build something like de-x-wd-Q1980305 when 
generating the output (or maybe just "de" if the output user do not want to 
have custom extensions).

Thomas

[1] https://tools.ietf.org/html/bcp47#section-2.2.7

> Le 11 avr. 2017 à 03:01, Stas Malyshev  a écrit :
> 
> Hi!
> 
>> For instance, we not only need identifiers for German, Swiss and
>> Austrian German. We also need identifiers for German German before
>> and after the spelling reform of 1901, and before and ofter the
>> spelling reform of 1996. We will also
> 
> Theoretically, BCP 47 should be able to handle this? E.g. they have
> sl-IT-rozaj-biske-1994 as an example. But we probably shouldn't try to
> construct these tags ourselves, but instead let editors specify them.
> 
>> need identifiers for the "language" of mathematical notation. And for
>> various
> 
> That's where using zxx-math or something like that would be useful? Or
> we could omit any tags from those at all.

There is the Zmth script subtag for that.

>> The only system I know that gives us that flexibility is Wikidata.
>> For interoperability, we should provide a standard language code (aka
>> subtag). But a language code alone is not going to be sufficient to
>> distinguish the different variants we will need.
> 
> I think it can be, using BCP 47 extensions, but Wikidata team should not
> be taking care of it - instead, Wikidata editors should do it by
> assigning language tag properties to specific Wikidata items.
> --
> Stas Malyshev
> smalys...@wikimedia.org
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



signature.asc
Description: Message signed with OpenPGP
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Stas Malyshev
Hi!

> For instance, we not only need identifiers for German, Swiss and
> Austrian German. We also need identifiers for German German before
> and after the spelling reform of 1901, and before and ofter the
> spelling reform of 1996. We will also

Theoretically, BCP 47 should be able to handle this? E.g. they have
sl-IT-rozaj-biske-1994 as an example. But we probably shouldn't try to
construct these tags ourselves, but instead let editors specify them.

> need identifiers for the "language" of mathematical notation. And for
> various

That's where using zxx-math or something like that would be useful? Or
we could omit any tags from those at all.

> The only system I know that gives us that flexibility is Wikidata.
> For interoperability, we should provide a standard language code (aka
> subtag). But a language code alone is not going to be sufficient to
> distinguish the different variants we will need.

I think it can be, using BCP 47 extensions, but Wikidata team should not
be taking care of it - instead, Wikidata editors should do it by
assigning language tag properties to specific Wikidata items.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Stas Malyshev
Hi!

> We will want to distinguish "a known language not on this list (mis)" from "an
> unknown language (und)" and "translingual" (Wiktionary uses "mul" for
> translingual, but that's not technically correct).

I think "mul" is for "text in more than one language" and there's also
"zxx" is for "text that is not defined as being in any language at all".

BTW, BCP 47 also says:

The 'mis' (Uncoded) primary language subtag identifies content
whose language is known but that does not currently have a
corresponding subtag.  This subtag SHOULD NOT be used.
Because the addition of other codes in the future can render
its application invalid, it is inherently unstable and hence
incompatible with the stability goals of BCP 47.  It is always
preferable to use other subtags: either 'und' or (with prior
agreement) private use subtags.

So maybe using und would be a good idea.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Gerard Meijssen
Hoi,
The standard is flexible. It allows you to add user defined parts. It
allows for language that have no recognised language code. The point is
that the solution for external parties cannot be found in Wikidata itself.
We have to use the standards if we want interoperability. We need
interoperability and we need to define what it is that is expressed. Once
we decide that a specific expression of language is in use, we stick with
that definition. It can only be deprecated if that is what people want.
Thanks,
  GerardM

On 10 April 2017 at 19:10, Daniel Kinzler 
wrote:

> Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:
> > Hoi,
> > The standard for the identification of a language should suffice.
>
> I know no standard that would be sufficient for our use case.
>
> For instance, we not only need identifiers for German, Swiss and Austrian
> German. We also need identifiers for German German before and after the
> spelling
> reform of 1901, and before and ofter the spelling reform of 1996. We will
> also
> need identifiers for the "language" of mathematical notation. And for
> various
> variants of ancient languages: not just Sumerian, but Sumerian from
> different
> regions and periods.
>
> The only system I know that gives us that flexibility is Wikidata. For
> interoperability, we should provide a standard language code (aka subtag).
> But a
> language code alone is not going to be sufficient to distinguish the
> different
> variants we will need.
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Daniel Kinzler
Am 10.04.2017 um 18:12 schrieb Denny Vrandečić:
> So assume we enter a new Lexeme in Examplarian (which has a Q-Item), but
> Examplarian has no language code for whatever reason. What language code would
> they enter in the MultilingualTextValue?

My plan is: it will be "mis+Q7654321" internally, which will be exposed in HTML
and RDF as "mis".

We will want to distinguish "a known language not on this list (mis)" from "an
unknown language (und)" and "translingual" (Wiktionary uses "mul" for
translingual, but that's not technically correct).

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Daniel Kinzler
Am 10.04.2017 um 19:24 schrieb Denny Vrandečić:
> Daniel, I agree, but isn't that what Multilingual Text requires? A language 
> code?

Yes. Well, internally, it just has to be *some* unique code. But for
interoperability, we want it to be a standard code. So I propose to internally
use something like "de+Q1980305", and expose that as "de" externally. This
allows us to distinguish however many variants of German we want internally, and
tag them all as "de" in HTML and RDF, so standard tools can use the language
information.

> I assume most of it is hidden behind mini-wizards like "Create a new lexeme",
> which actually make sure the multitext language and the language property are
> consistently set. In that case I can see this work.

Yes, that is exactly the plan for the NewLexeme page.

We'll still have to come up with a nifty UI for "add a lemma, select a language,
and optionally an item identifying a variant of that language".

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Denny Vrandečić
Daniel, I agree, but isn't that what Multilingual Text requires? A language
code?

I.e. how does the current model plan to solve that?

I assume most of it is hidden behind mini-wizards like "Create a new
lexeme", which actually make sure the multitext language and the language
property are consistently set. In that case I can see this work.



On Mon, Apr 10, 2017 at 10:11 AM Daniel Kinzler 
wrote:

> Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:
> > Hoi,
> > The standard for the identification of a language should suffice.
>
> I know no standard that would be sufficient for our use case.
>
> For instance, we not only need identifiers for German, Swiss and Austrian
> German. We also need identifiers for German German before and after the
> spelling
> reform of 1901, and before and ofter the spelling reform of 1996. We will
> also
> need identifiers for the "language" of mathematical notation. And for
> various
> variants of ancient languages: not just Sumerian, but Sumerian from
> different
> regions and periods.
>
> The only system I know that gives us that flexibility is Wikidata. For
> interoperability, we should provide a standard language code (aka subtag).
> But a
> language code alone is not going to be sufficient to distinguish the
> different
> variants we will need.
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Daniel Kinzler
Am 10.04.2017 um 18:56 schrieb Gerard Meijssen:
> Hoi,
> The standard for the identification of a language should suffice.

I know no standard that would be sufficient for our use case.

For instance, we not only need identifiers for German, Swiss and Austrian
German. We also need identifiers for German German before and after the spelling
reform of 1901, and before and ofter the spelling reform of 1996. We will also
need identifiers for the "language" of mathematical notation. And for various
variants of ancient languages: not just Sumerian, but Sumerian from different
regions and periods.

The only system I know that gives us that flexibility is Wikidata. For
interoperability, we should provide a standard language code (aka subtag). But a
language code alone is not going to be sufficient to distinguish the different
variants we will need.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Gerard Meijssen
Hoi,
The standard for the identification of a language should suffice. As long
as we follow the standard and insist on the identification in this manner
it is always possible to provide an identifcation. When you insist on a an
item ID, that item ID needs to have a language code and this language code
must never change.

Without this there is no interoperability.
Thanks,
  GerardM

On 10 April 2017 at 17:42, Daniel Kinzler 
wrote:

> Tobias' comment made me realize that I did not clarify wone very important
> distinction: there are two kinds of places where a "language" is needed in
> the
> Lexeme data model
> :
>
> 1) the "lexeme language". This can be any Item, language code or no. This
> is
> what Tobias would have to use in his query.
>
> 2) the language codes used in the MultilingualTextValues (lemma,
> representation,
> and gloss). This is where my "hybrid" approach comes in: use a standard
> language
> code augmented by an item ID to identify the variant.
>
> To make it easy to create new Lexemes, the lexeme language can serve as a
> default for lemma, representation, and gloss - but only if it has a
> language
> code. If it does not have one, the user will have to specify one for use in
> MultilingualTextValues.
>
>
> Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:
> > An example using the second suggestion:
> >
> > If I would like to query all L-items that contain a combination of
> letters and
> > limit those results by getting the Q-items of the language and limit
> those, to
> > those that have Latin influences.
> >
> > In my imagination this would work better using the second suggestion.
> Also the
> > flexibility of "what is a language" and "what is a dialect" would seem
> easier if
> > we can attach statements to the UserLanguageCode or the Q-item of the
> language.
> >
> > -Tobias
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Denny Vrandečić
So assume we enter a new Lexeme in Examplarian (which has a Q-Item), but
Examplarian has no language code for whatever reason. What language code
would they enter in the MultilingualTextValue?


On Mon, Apr 10, 2017 at 8:42 AM Daniel Kinzler 
wrote:

> Tobias' comment made me realize that I did not clarify wone very important
> distinction: there are two kinds of places where a "language" is needed in
> the
> Lexeme data model
> :
>
> 1) the "lexeme language". This can be any Item, language code or no. This
> is
> what Tobias would have to use in his query.
>
> 2) the language codes used in the MultilingualTextValues (lemma,
> representation,
> and gloss). This is where my "hybrid" approach comes in: use a standard
> language
> code augmented by an item ID to identify the variant.
>
> To make it easy to create new Lexemes, the lexeme language can serve as a
> default for lemma, representation, and gloss - but only if it has a
> language
> code. If it does not have one, the user will have to specify one for use in
> MultilingualTextValues.
>
>
> Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:
> > An example using the second suggestion:
> >
> > If I would like to query all L-items that contain a combination of
> letters and
> > limit those results by getting the Q-items of the language and limit
> those, to
> > those that have Latin influences.
> >
> > In my imagination this would work better using the second suggestion.
> Also the
> > flexibility of "what is a language" and "what is a dialect" would seem
> easier if
> > we can attach statements to the UserLanguageCode or the Q-item of the
> language.
> >
> > -Tobias
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-10 Thread Daniel Kinzler
Tobias' comment made me realize that I did not clarify wone very important
distinction: there are two kinds of places where a "language" is needed in the
Lexeme data model
:

1) the "lexeme language". This can be any Item, language code or no. This is
what Tobias would have to use in his query.

2) the language codes used in the MultilingualTextValues (lemma, representation,
and gloss). This is where my "hybrid" approach comes in: use a standard language
code augmented by an item ID to identify the variant.

To make it easy to create new Lexemes, the lexeme language can serve as a
default for lemma, representation, and gloss - but only if it has a language
code. If it does not have one, the user will have to specify one for use in
MultilingualTextValues.


Am 06.04.2017 um 19:59 schrieb Tobias Schönberg:
> An example using the second suggestion:
> 
> If I would like to query all L-items that contain a combination of letters and
> limit those results by getting the Q-items of the language and limit those, to
> those that have Latin influences.
> 
> In my imagination this would work better using the second suggestion. Also the
> flexibility of "what is a language" and "what is a dialect" would seem easier 
> if
> we can attach statements to the UserLanguageCode or the Q-item of the 
> language.
> 
> -Tobias


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-07 Thread Info WorldUniversity
Denny,

Yes, yet the timing is good for these great developments you're making with
languages in Wikidata4Wiktionary.

Cheers,
Scott



On Fri, Apr 7, 2017 at 11:50 AM, Denny Vrandečić 
wrote:

> Scott,
>
> I assume you realized that the article by Norvig you cited was rather
> intentionally published on April 1st.
>
> Cheers,
> Denny
>
> On Fri, Apr 7, 2017 at 11:04 AM Scott MacLeod  gmail.com> wrote:
>
>> I tried to see how the ISO codes and IANA language subtags compare with
>> Glottolog's 8,444 entries under languages (http://glottolog.org/
>> glottolog/language) and Ethnologue's 7,099 living languages (
>> https://www.ethnologue.com/), but couldn't find any comparisons or
>> comparative lists.
>>
>> Will it be possible with these new developments in Wikidata to query for
>> these possibilities, and leave the options open for a growing list of
>> languages, as well as an universal translator?
>>
>> And how will invented languages be added, such as Krell, Elvish and
>> Klingon (and even other species' languages in emergent interspecies'
>> communications), and possibly per OpenNMT (Neural Machine Translation) -
>> http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent
>> article in the regards to OpenNMT and invented languages -
>> https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and
>> per http://scott-macleod.blogspot.com/2017/04/falco-peregrinus-
>> smartphone-that-could.html).
>>
>> Scott
>>
>>
>>
>> On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler <
>> daniel.kinz...@wikimedia.de> wrote:
>>
>> Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:
>> > I foresee that might be a bit of a problem for external tools
>> consuming
>> > this data - how they would figure out what language it is if it's
>> > doesn't have a code? We could of course generate fake codes like
>> > mis-x-q12345, maybe that would work.
>> >
>> > Q-items for languages already have a property to state their language
>> code. It's
>> > just an extra hop away.
>>
>> We want ISO codes (or rather, IANA language subtags [1]), so we can use
>> them in
>> HTML lang attributes, and in RDF literals. This allows interoperability
>> with
>> standard tools.
>>
>> For this reason, I also favor a mixed approach, that allows standard
>> language
>> tags to be used whenever possible. I have some ideas on how that could
>> work, but
>> no definite plan yet.
>>
>> Something like de+Q1980305 could work; when generating HTML or RDF, we'd
>> just
>> drop the suffix. For transligual entries (e.g. the for number symbol i),
>> we
>> could use e.g. mis+Q1140046.
>>
>>
>> [1]
>> https://www.iana.org/assignments/language-subtag-
>> registry/language-subtag-registry
>>
>> --
>> Daniel Kinzler
>> Principal Platform Engineer
>>
>> Wikimedia Deutschland
>> Gesellschaft zur Förderung Freien Wissens e.V.
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>>
>> --
>>
>> --
>> - Scott MacLeod - Founder & President
>> - World University and School
>> - http://worlduniversityandschool.org
>>
>> - 415 480 4577 <(415)%20480-4577>
>> - http://scottmacleod.com
>>
>>
>> - CC World University and School - like CC Wikipedia with best
>> STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and
>> school in California, and is a U.S. 501 (c) (3) tax-exempt educational
>> organization.
>>
>>
>> IMPORTANT NOTICE: This transmission and any attachments are intended only
>> for the use of the individual or entity to which they are addressed and may
>> contain information that is privileged, confidential, or exempt from
>> disclosure under applicable federal or state laws.  If the reader of this
>> transmission is not the intended recipient, you are hereby notified that
>> any use, dissemination, distribution, or copying of this communication is
>> strictly prohibited.  If you have received this transmission in error,
>> please notify me immediately by email or telephone.
>>
>> World University and School is sending you this because of your interest
>> in free, online, higher education. If you don't want to receive these,
>> please reply with 'unsubscribe' in the body of the email, leaving the
>> subject line intact. Thank you.
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 

-- 
- Scott MacLeod - Founder & President
- 415 480 4577
- http://scottmacleod.com

- World University and School
- http://worlduniversityandschool.org

- CC World University and School - like CC Wikipedia with best STEM-centric
CC OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.


IMPORTANT NOTICE: This transmission and any attachments are intended only

Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-07 Thread Denny Vrandečić
Scott,

I assume you realized that the article by Norvig you cited was rather
intentionally published on April 1st.

Cheers,
Denny

On Fri, Apr 7, 2017 at 11:04 AM Scott MacLeod <
worlduniversityandsch...@gmail.com> wrote:

> I tried to see how the ISO codes and IANA language subtags compare with
> Glottolog's 8,444 entries under languages (
> http://glottolog.org/glottolog/language) and Ethnologue's 7,099 living
> languages (https://www.ethnologue.com/), but couldn't find any
> comparisons or comparative lists.
>
> Will it be possible with these new developments in Wikidata to query for
> these possibilities, and leave the options open for a growing list of
> languages, as well as an universal translator?
>
> And how will invented languages be added, such as Krell, Elvish and
> Klingon (and even other species' languages in emergent interspecies'
> communications), and possibly per OpenNMT (Neural Machine Translation) -
> http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent
> article in the regards to OpenNMT and invented languages -
> https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and
> per
> http://scott-macleod.blogspot.com/2017/04/falco-peregrinus-smartphone-that-could.html
> ).
>
> Scott
>
>
>
> On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler <
> daniel.kinz...@wikimedia.de> wrote:
>
> Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:
> > I foresee that might be a bit of a problem for external tools
> consuming
> > this data - how they would figure out what language it is if it's
> > doesn't have a code? We could of course generate fake codes like
> > mis-x-q12345, maybe that would work.
> >
> > Q-items for languages already have a property to state their language
> code. It's
> > just an extra hop away.
>
> We want ISO codes (or rather, IANA language subtags [1]), so we can use
> them in
> HTML lang attributes, and in RDF literals. This allows interoperability
> with
> standard tools.
>
> For this reason, I also favor a mixed approach, that allows standard
> language
> tags to be used whenever possible. I have some ideas on how that could
> work, but
> no definite plan yet.
>
> Something like de+Q1980305 could work; when generating HTML or RDF, we'd
> just
> drop the suffix. For transligual entries (e.g. the for number symbol i), we
> could use e.g. mis+Q1140046.
>
>
> [1]
>
> https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
>
> --
>
> --
> - Scott MacLeod - Founder & President
> - World University and School
> - http://worlduniversityandschool.org
>
> - 415 480 4577 <(415)%20480-4577>
> - http://scottmacleod.com
>
>
> - CC World University and School - like CC Wikipedia with best
> STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and
> school in California, and is a U.S. 501 (c) (3) tax-exempt educational
> organization.
>
>
> IMPORTANT NOTICE: This transmission and any attachments are intended only
> for the use of the individual or entity to which they are addressed and may
> contain information that is privileged, confidential, or exempt from
> disclosure under applicable federal or state laws.  If the reader of this
> transmission is not the intended recipient, you are hereby notified that
> any use, dissemination, distribution, or copying of this communication is
> strictly prohibited.  If you have received this transmission in error,
> please notify me immediately by email or telephone.
>
> World University and School is sending you this because of your interest
> in free, online, higher education. If you don't want to receive these,
> please reply with 'unsubscribe' in the body of the email, leaving the
> subject line intact. Thank you.
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-07 Thread Scott MacLeod
I tried to see how the ISO codes and IANA language subtags compare with
Glottolog's 8,444 entries under languages (
http://glottolog.org/glottolog/language) and Ethnologue's 7,099 living
languages (https://www.ethnologue.com/), but couldn't find any comparisons
or comparative lists.

Will it be possible with these new developments in Wikidata to query for
these possibilities, and leave the options open for a growing list of
languages, as well as an universal translator?

And how will invented languages be added, such as Krell, Elvish and Klingon
(and even other species' languages in emergent interspecies'
communications), and possibly per OpenNMT (Neural Machine Translation) -
http://opennmt.net/ (and possibly GNMT); see also Peter Norvig's recent
article in the regards to OpenNMT and invented languages -
https://medium.com/@peternorvig/last-tweets-of-the-krell-82b8cb74c320 (and
per
http://scott-macleod.blogspot.com/2017/04/falco-peregrinus-smartphone-that-could.html
).

Scott



On Fri, Apr 7, 2017 at 10:13 AM, Daniel Kinzler  wrote:

> Am 07.04.2017 um 01:34 schrieb Denny Vrandečić:
> > I foresee that might be a bit of a problem for external tools
> consuming
> > this data - how they would figure out what language it is if it's
> > doesn't have a code? We could of course generate fake codes like
> > mis-x-q12345, maybe that would work.
> >
> > Q-items for languages already have a property to state their language
> code. It's
> > just an extra hop away.
>
> We want ISO codes (or rather, IANA language subtags [1]), so we can use
> them in
> HTML lang attributes, and in RDF literals. This allows interoperability
> with
> standard tools.
>
> For this reason, I also favor a mixed approach, that allows standard
> language
> tags to be used whenever possible. I have some ideas on how that could
> work, but
> no definite plan yet.
>
> Something like de+Q1980305 could work; when generating HTML or RDF, we'd
> just
> drop the suffix. For transligual entries (e.g. the for number symbol i), we
> could use e.g. mis+Q1140046.
>
>
> [1]
> https://www.iana.org/assignments/language-subtag-registry/language-subtag-
> registry
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 

-- 
- Scott MacLeod - Founder & President
- World University and School
- http://worlduniversityandschool.org

- 415 480 4577
- http://scottmacleod.com


- CC World University and School - like CC Wikipedia with best STEM-centric
CC OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.


IMPORTANT NOTICE: This transmission and any attachments are intended only
for the use of the individual or entity to which they are addressed and may
contain information that is privileged, confidential, or exempt from
disclosure under applicable federal or state laws.  If the reader of this
transmission is not the intended recipient, you are hereby notified that
any use, dissemination, distribution, or copying of this communication is
strictly prohibited.  If you have received this transmission in error,
please notify me immediately by email or telephone.

World University and School is sending you this because of your interest in
free, online, higher education. If you don't want to receive these, please
reply with 'unsubscribe' in the body of the email, leaving the subject line
intact. Thank you.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-07 Thread Stas Malyshev
Hi!

> Something like de+Q1980305 could work; when generating HTML or RDF, we'd just
> drop the suffix. For transligual entries (e.g. the for number symbol i), we
> could use e.g. mis+Q1140046.

I think for those that are not in particular language, und or zxx could
be better. mis as I read it is for "this is in specific language, but we
don't have a code for it".
See https://en.wikipedia.org/wiki/ISO_639

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-07 Thread David Cuenca Tudela
Personally I would prefer a mixed approach, where there is a list of
top-level items that are authorized, and then verifying that the item used
is a subclass of any of those items. Whether those constraints are
hard-enforced or just supervised could be a topic of discussion, but IMHO
the more automated, the better.

Regarding the codes, it can be generated with the code of the top-level
item+the Q number of the item used. If someone wants to use one or the
other, it should be quite easy to remove.

Cheers,
David

On Thu, Apr 6, 2017 at 6:51 PM, Denny Vrandečić  wrote:

> The current spec of the data model states that an L-Item has a lemma, a
> language, and several forms, and the forms in turn have representations.
>
> https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model
>
> The language is a Q-Item, the lemma and the representations are
> Multilingual Texts. Multilingual texts are sets of pairs of strings and
> UserLanguageCodes.
>
> My question is about the relation between representing a language as a
> Q-Item and as a UserLanguageCode.
>
> A previous proposal treated lemmas and representations as raw strings,
> with the language pointing to the Q-Item being the only language
> information. This now is gone, and the lemma and representation carry their
> own language information.
>
> How do they interact? The language set referencable through Q-Items is
> much larger than the set of languages with a UserLanguageCode, and indeed,
> the intention was to allow for every language to be representable in
> Wikidata, not only those with a UserLanguageCode.
>
> I sense quite a problem here.
>
> I see two possible ways to resolve this:
> - return to the original model and use strings instead of Multilingual
> texts (with all the negative implications for variants)
> - use Q-Items instead of UserLanguageCodes for Multilingual texts (which
> would be quite a migration)
>
> I don't think restricting Wiktionary4Wikidata support to the list of
> languages with a UserLanguageCode is a viable solution, which would happen
> if we implement the data model as currently suggested, if I understand it
> correctly.
>
> Cheers,
> Denny
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 
Etiamsi omnes, ego non
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-06 Thread Gerard Meijssen
Hoi,
There are many valid possibilities to describe something that is not a
language and language used may represent a language that does not have a
language code. There is a standard for indicating languages; it allows for
something like "US-American Spanish" by combining a country and a language
code. This is well known.

The problem with everything that has not been recognised / standardised /
defined as a language is that it is highly political. The practical side is
that we can use an x in a code to indicate a special use. However, then
calling it a language is problematic because a language ought to mean that
its understanding is mutually exclusive.

Calling it a language code and use "expressed in" would imho work for any
form of language. When Wiktionaries content is imported in Wikidata, we
first have to have these languages codes agreed on. To first import the
bulk is no problem. It puts pressure on the resolution of such issues and
that is not half bad.
Thanks,
  GerardM

On 7 April 2017 at 01:34, Denny Vrandečić  wrote:

>
>
> On Thu, Apr 6, 2017, 16:16 Stas Malyshev  wrote:
>
>> Hi!
>>
>> > - use Q-Items instead of UserLanguageCodes for Multilingual texts (which
>> > would be quite a migration)
>>
>> I foresee that might be a bit of a problem for external tools consuming
>> this data - how they would figure out what language it is if it's
>> doesn't have a code? We could of course generate fake codes like
>> mis-x-q12345, maybe that would work.
>>
>
> Q-items for languages already have a property to state their language
> code. It's just an extra hop away.
>
>
>
>> > I don't think restricting Wiktionary4Wikidata support to the list of
>> > languages with a UserLanguageCode is a viable solution, which would
>> > happen if we implement the data model as currently suggested, if I
>> > understand it correctly.
>>
>> Aren't we limiting it right now this way in Wikidata?
>>
>
> For labels and descriptions of items yes, and I think that was sensible.
> It might be time to revisit that decision though.
>
> But for supporting Wiktionary that would be extremely limiting. French
> Wiktionary supports words in more than a thousand languages currently.
> Limiting the supported languages of the lemmas is, IMHO, unacceptable.
>
>
>> --
>> Stas Malyshev
>> smalys...@wikimedia.org
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-06 Thread Denny Vrandečić
On Thu, Apr 6, 2017, 16:16 Stas Malyshev  wrote:

> Hi!
>
> > - use Q-Items instead of UserLanguageCodes for Multilingual texts (which
> > would be quite a migration)
>
> I foresee that might be a bit of a problem for external tools consuming
> this data - how they would figure out what language it is if it's
> doesn't have a code? We could of course generate fake codes like
> mis-x-q12345, maybe that would work.
>

Q-items for languages already have a property to state their language code.
It's just an extra hop away.



> > I don't think restricting Wiktionary4Wikidata support to the list of
> > languages with a UserLanguageCode is a viable solution, which would
> > happen if we implement the data model as currently suggested, if I
> > understand it correctly.
>
> Aren't we limiting it right now this way in Wikidata?
>

For labels and descriptions of items yes, and I think that was sensible. It
might be time to revisit that decision though.

But for supporting Wiktionary that would be extremely limiting. French
Wiktionary supports words in more than a thousand languages currently.
Limiting the supported languages of the lemmas is, IMHO, unacceptable.


> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-06 Thread Stas Malyshev
Hi!

> - use Q-Items instead of UserLanguageCodes for Multilingual texts (which
> would be quite a migration)

I foresee that might be a bit of a problem for external tools consuming
this data - how they would figure out what language it is if it's
doesn't have a code? We could of course generate fake codes like
mis-x-q12345, maybe that would work.

> I don't think restricting Wiktionary4Wikidata support to the list of
> languages with a UserLanguageCode is a viable solution, which would
> happen if we implement the data model as currently suggested, if I
> understand it correctly.

Aren't we limiting it right now this way in Wikidata?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Languages in Wikidata4Wiktionary

2017-04-06 Thread Tobias Schönberg
An example using the second suggestion:

If I would like to query all L-items that contain a combination of letters
and limit those results by getting the Q-items of the language and limit
those, to those that have Latin influences.

In my imagination this would work better using the second suggestion. Also
the flexibility of "what is a language" and "what is a dialect" would seem
easier if we can attach statements to the UserLanguageCode or the Q-item of
the language.

-Tobias

2017-04-06 18:51 GMT+02:00 Denny Vrandečić :

> The current spec of the data model states that an L-Item has a lemma, a
> language, and several forms, and the forms in turn have representations.
>
> https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model
>
> The language is a Q-Item, the lemma and the representations are
> Multilingual Texts. Multilingual texts are sets of pairs of strings and
> UserLanguageCodes.
>
> My question is about the relation between representing a language as a
> Q-Item and as a UserLanguageCode.
>
> A previous proposal treated lemmas and representations as raw strings,
> with the language pointing to the Q-Item being the only language
> information. This now is gone, and the lemma and representation carry their
> own language information.
>
> How do they interact? The language set referencable through Q-Items is
> much larger than the set of languages with a UserLanguageCode, and indeed,
> the intention was to allow for every language to be representable in
> Wikidata, not only those with a UserLanguageCode.
>
> I sense quite a problem here.
>
> I see two possible ways to resolve this:
> - return to the original model and use strings instead of Multilingual
> texts (with all the negative implications for variants)
> - use Q-Items instead of UserLanguageCodes for Multilingual texts (which
> would be quite a migration)
>
> I don't think restricting Wiktionary4Wikidata support to the list of
> languages with a UserLanguageCode is a viable solution, which would happen
> if we implement the data model as currently suggested, if I understand it
> correctly.
>
> Cheers,
> Denny
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata