[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-14 Thread Lydia_Pintscher
Lydia_Pintscher closed this task as "Resolved".
Lydia_Pintscher added a comment.


  No let's close it. Thanks a lot :)

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Lydia_Pintscher
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-14 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher Can we resolve this ticket or do we need anything else here?

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-12 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.


  In T269587#6685887 , 
@GoranSMilovanovic wrote:
  
  > @Lydia_Pintscher Here it goes:
  >
  > F33943238: propertyLanguages_20201211.csv 

  
  Thanks! Will analyze.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Lydia_Pintscher
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-11 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher Here it goes:
  
  F33943238: propertyLanguages_20201211.csv 


TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-11 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher Of course, it will be produced and posted here during the 
day.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-11 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.


  In T269587#6680787 , 
@GoranSMilovanovic wrote:
  
  > @Lydia_Pintscher
  >
  >> Check coverage of labels, descriptions, aliases on Properties
  >
  > Please see the `csv` file attached. Fields:
  >
  > - `property`
  > - `labels` - how many labels
  > - `aliases` - in how many different languages do we find aliases for this 
property
  > - `descriptions` - how many descriptions
  > - `percentLanguages_Labels` - percentage of languages covered by the labels 
of the respective property
  > - `percentLanguages_Descriptions` - percentage of languages covered by the 
descriptions of the respective property
  > - `percentLanguages_Aliases`  - percentage of languages covered by the 
aliases of the respective property
  >
  > **Note.** I find (SPARQL, WDQS) that we currently have **581** languages 
with the Wikimedia Language Code present. That number, 581, was used as a 
denominator to calculate the percentages that are reported in the table.
  >
  > F33941268: propertyLanguages.csv 

  
  Thank you!
  
  Could we get a csv of the following?
  
  - Property ID
  - English Label
  - data type of the Property
  - number of labels
  - number of descriptions
  - number of aliases
  - number of Items using the Property

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Lydia_Pintscher
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-10 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  In respect to T269587#6680787 
 - we need to change the 
anchor (languages w. Wikimedia Language Code).

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher
  
  > Check coverage of labels, descriptions, aliases on Properties
  
  Please see the `csv` file attached. Fields:
  
  - `property`
  - `labels` - how many labels
  - `aliases` - in how many different languages do we find aliases for this 
property
  - `descriptions` - how many descriptions
  - `percentLanguages_Labels` - percentage of languages covered by the labels 
of the respective property
  - `percentLanguages_Descriptions` - percentage of languages covered by the 
descriptions of the respective property
  - `percentLanguages_Aliases`  - percentage of languages covered by the 
aliases of the respective property
  
  **Note.** I find (SPARQL, WDQS) that we currently have **581** languages with 
the Wikimedia Language Code present. That number, 581, was used as a 
denominator to calculate the percentages that are reported in the table.
  
  F33941268: propertyLanguages.csv 

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher Ok, the data reported in T269587#6679451 
 seem to be fine.
  
  The the list of all "hanging items"  - items with no `P31`, `P279`, or `P361` 
value - relative to what was found in the `2020-11-23` version of the hdfs copy 
of the WD JSON dump, of course, is too large to be shared here. A `zip` archive 
will be shared with you via email or Google Drive.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher The data reported in T269587#6679451 
 will have to undergo 
revision, I have spotted a glitch in my filtering procedures in Pyspark.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher
  
  > properties that can be used to check completeness (e.g. "number of 
children" + "number of participants")
  > find a list of such "structural" properties
  
  Well, what I did was the following:
  
  - Apache Spark to parse the most recent JSON dump
  - single out all properties with the English label satisfying the simple 
regex `number_of`;
  - the list is attached, columns:
- **id**
- **language**
- **label**
- **usable**: `0` means "not usable" for the purpose of this task, `1` 
means usable, `1/2` means "maybe, but..."
- **comment**: typically accompanies the `1/2` value on usable.
  
  If you take a look at my comments in the **comment** column, you will see 
that not too many properties of this form can be used in accordance with the 
idea in our example:  *properties that can be used to check completeness (e.g. 
"number of children" + "number of participants")*. We have a property for a 
number of people arrested in an event, for example, but... Do we have anywhere 
represented all such people by QIDs in Wikidata so that we could check for 
completeness? I don't think so.
  
  Your thoughts and comments, please. Thank you.
  
  F33940896: WD_StructuralProperties.csv 


TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Lydia_Pintscher
  
  > How many entities do we have that are not classified via `instance of P31`, 
`subsclass of P279`, and `part of P261`?
  
  According to the most recent version of the hdfs version of the Wikidata JSON 
dump (snapshot: `2020-11-23`):
  
  - there are `90,880,584` items in Wikidata, while
  - there are `87,905,748` that are `P31`, `P279`, or `P361` of something;
  - thus, there are `90,880,584` - `87,905,748` = `2,974,836` "hanging items" 
that are not `P31`, `P279`, or `P261` of anything, which is
  - `2,974,836`/`90,880,584`*`100` = `3.273346%` of Wikidata.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-09 Thread GoranSMilovanovic
GoranSMilovanovic updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Silvan_WMDE, Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, 
Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T269587: Low hanging fruits for the WMDE Data Quality WD/WB Team

2020-12-07 Thread GoranSMilovanovic
GoranSMilovanovic created this task.
GoranSMilovanovic added projects: User-GoranSMilovanovic, 
WMDE-Analytics-Engineering, Wikidata.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  - Produce all "immediately" available indicators derived from the discussion 
in the WMDE Data Quality WD/WB Team
  - Re-use all Pyspark code used to parse the hdfs copy of the WD JSON dumo;
  - Mind the scaling!

TASK DETAIL
  https://phabricator.wikimedia.org/T269587

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Lydia_Pintscher, GoranSMilovanovic, Aklapper, Akuckartz, Nandana, Lahi, 
Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, 
aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs