Re: [Wikidata] Concise/Notable Wikidata Dump

2020-01-07 Thread Simon Razniewski
Hi,

Just wanted to express my belated support for such dumps:
 - We encounter the same problem in research, and both for efficiency,
reproducibility, and authoritativeness a centralized solution would be
great.
 - Besides the filtering for existence in Wikipedia, I'd see much potential
in removing labels. In most of our use cases labels are not needed in the
computations, and where introspection is needed, one can selectively add
them post-hoc. Alternatively, only retaining English labels would also save
much (and I don't see concerns of cultural bias, as long as we only use
them as decoration, not inside computations).

Thanks also for pointing out the WDumper tool, this looks great. Maybe it
would be worth to highlight selected dumps prominently on its start page?
(the names in "recent dumps" alone are not always informative, so one has
to inspect the specs one by one, and arguably, the larger number also means
some loss of authoritativeness)

Cheers,
Simon


> Hey all,
>
> As someone who likes to use Wikidata in their research, and likes to
> give students projects relating to Wikidata, I am finding it more and
> more difficult to (recommend to) work with recent versions of Wikidata
> due to the increasing dump sizes, where even the truthy version now
> costs considerable time and machine resources to process and handle. In
> some cases we just grin and bear the costs, while in other cases we
> apply an ad hoc sampling to be able to play around with the data and try
> things quickly.
>
> More generally, I think the growing data volumes might inadvertently
> scare people off taking the dumps and using them in their research.
>
> One idea we had recently to reduce the data size for a student project
> while keeping the most notable parts of Wikidata was to only keep claims
> that involve an item linked to Wikipedia; in other words, if the
> statement involves a Q item (in the "subject" or "object") not linked to
> Wikipedia, the statement is removed.
>
> I wonder would it be possible for Wikidata to provide such a dump to
> download (e.g., in RDF) for people who prefer to work with a more
> concise sub-graph that still maintains the most "notable" parts? While
> of course one could compute this from the full-dump locally, making such
> a version available as a dump directly would save clients some
> resources, potentially encourage more research using/on Wikidata, and
> having such a version "rubber-stamped" by Wikidata would also help to
> justify the use of such a dataset for research purposes.
>
> ... just an idea I thought I would float out there. Perhaps there is
> another (better) way to define a concise dump.
>
> Best,
> Aidan
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-22 Thread Amirouche Boubekki
Hello all!

Le mar. 17 déc. 2019 à 18:15, Aidan Hogan  a écrit :
>
> Hey all,
>
> As someone who likes to use Wikidata in their research, and likes to
> give students projects relating to Wikidata, I am finding it more and
> more difficult to (recommend to) work with recent versions of Wikidata
> due to the increasing dump sizes, where even the truthy version now
> costs considerable time and machine resources to process and handle.

Maybe that is a software problem? What tools do you use to process the dump?

> More generally, I think the growing data volumes might inadvertently
> scare people off taking the dumps and using them in their research.
>
> One idea we had recently to reduce the data size for a student project
> while keeping the most notable parts of Wikidata was to only keep claims
> that involve an item linked to Wikipedia; in other words, if the
> statement involves a Q item (in the "subject" or "object") not linked to
> Wikipedia, the statement is removed.

One similar scheme will be to only keep concepts that are part of wikipedia
vital article [0] and their neighboors (to be defined).

[0] https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5

Related to wikipedia vital articles, for which I only know the english version,
the problem with that is that wikipedia vital articles are not
available in structured
format.  I made a few month back a proposal to add that information to wikidata,
I had no feedback.  There is https://www.wikidata.org/wiki/Q43375360.
Not sure where to go from there.

> I wonder would it be possible for Wikidata to provide such a dump to
> download (e.g., in RDF) for people who prefer to work with a more
> concise sub-graph that still maintains the most "notable" parts?

The best thing would be to allow people to create their own vital wikidata
concepts, similar to how there is custom wikipedia vital lists and taking
inspiration from to the tool that was released recently.

> While
> of course one could compute this from the full-dump locally, making such
> a version available as a dump directly would save clients some
> resources, potentially encourage more research using/on Wikidata, and
> having such a version "rubber-stamped" by Wikidata would also help to
> justify the use of such a dataset for research purposes.

I agree.

> ... just an idea I thought I would float out there. Perhaps there is
> another (better) way to define a concise dump.
>
> Best,
> Aidan
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump Wikidata Digest, Vol 97, Issue 13

2019-12-21 Thread PWN
Hello all,
Regarding the limiting of dumps, I fear it nullifies one of the huge advantages 
of wikidata, which is to expand structured, referenced data beyond the often 
too narrow confines of Wikipedia. Women and marginalized communities who are 
frequently eliminated for lack of “notability” by overzealous or misguided 
Wikipedia editors risk being accidentally re-eliminated by confining dumps to 
items with wikilinks. (Remember the female researcher whose Wikipedia page was 
rejected for “lack of notability” - just before she won a Noble prize?)

I think Wikidata dumps should be complete, with a possibility of 
user-controlled selection by topic or period or other query, but not by what 
amounts to a kind of a “hidden” filter of approval by a Wikipedia editor 
somewhere outside of Wikidata in a widely disseminated dump marked, 
misleadingly, as “notable”. 

Selection is very powerful in the digital world, where people assume (wrongly) 
that what they see is what exists



Sent from my iPad

> On Dec 20, 2019, at 13:00, wikidata-requ...@lists.wikimedia.org wrote:
> 
> Send Wikidata mailing list submissions to
>wikidata@lists.wikimedia.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>https://lists.wikimedia.org/mailman/listinfo/wikidata
> or, via email, send a message with subject or body 'help' to
>wikidata-requ...@lists.wikimedia.org
> 
> You can reach the person managing the list at
>wikidata-ow...@lists.wikimedia.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikidata digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Concise/Notable Wikidata Dump (Aidan Hogan)
> 
> 
> --
> 
> Message: 1
> Date: Thu, 19 Dec 2019 19:15:09 -0300
> From: Aidan Hogan 
> To: wikidata@lists.wikimedia.org
> Subject: Re: [Wikidata] Concise/Notable Wikidata Dump
> Message-ID: 
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Hey all,
> 
> Just a general response to all the comments thus far.
> 
> - @Marco et al., regarding the WDumper by Benno, this is a very cool 
> initiative! In fact I spotted it just *after* posting so I think this 
> goes quite some ways towards addressing the general issue raised.
> 
> - @Markus, I partially disagree regarding the importance of 
> rubber-stamping a "notable dump" on the Wikidata side. I would see it's 
> value as being something like the "truthy dump", which I believe has 
> been widely used in research for working with a concise sub-set of 
> Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to be 
> generated by WDumper and published on Zenodo. This may be sufficient in 
> terms of making the dump available and reusable for research purposes 
> (or even better than the current dumps, given the permanence you 
> mention). Also it would reduce costs on the Wikidata side (I don't think 
> a notable dump would be necessary to generate on a weekly basis, for 
> example).
> 
> - @Lydia, good point! I was thinking that filtering by wikilinks will 
> just drop some more obscure nodes (like Q51366847 for example), but had 
> not considered that there are some more general "concepts" that do not 
> have a corresponding Wikipedia article. All the same, in a lot of the 
> research we use Wikidata for, we are not particularly interested in one 
> thing or another, but more interested in facilitating what other people 
> are interested in. Examples would be query performance, finding paths, 
> versioning, finding references, etc. But point taken! Maybe there is a 
> way to identify "general entities" that do not have wikilinks, but do 
> have a high degree or centrality, for example? Would a degree-based or 
> centrality-based filter be possible in something like WDumper (perhaps 
> it goes beyond the original purpose; certainly it does not seem trivial 
> in terms of resources used)? Would it be a good idea?
> 
> In summary, I like the idea of using WDumper to sporadically generate -- 
> and publish on Zenodo -- a "notable version" of Wikidata filtered by 
> sitelinks (perhaps also allowing other high-degree or high-PageRank 
> nodes to pass the filter). At least I know I would use such a dump.
> 
> Best,
> Aidan
> 
>> On 2019-12-19 6:46, Lydia Pintscher wrote:
>>> On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan  wrote:
>>> 
>>> Hey all,
>>> 
>>> As someone who likes to use Wikidata in their research, and likes to
>>> give students projects relating to Wikidata, I am finding it more and
>>> more difficult to (recommend to) work with recent versions of Wikidata
>&

Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-21 Thread Sebastian Hellmann

Hi Aidan,

since DBpedia has been around for twelve years now, we spent the last 3 
years intensively re-engineering to solve problems like this.


Last week, we finished the Virtuoso DBpedia Docker[1]  to work on 
Databus Collections[2],[3]. Databus contains different repartitions of 
all the datasets, i.e. Wikipedia/Wikidata extractions and external data. 
The idea here is that datasets or graphs are stored in a granular manner 
and then you make your own collection (DCAT Catalog) or re-use 
collections by others.


This will go in the direction to build 1 Billion derived Knowledge 
Graphs until 2025: 
https://databus.dbpedia.org/dbpedia/publication/strategy/2019.09.09/strategy_databus_initiative.pdf


We analysed a lot of problems in the GlobalFactSync project [5] and 
studied Wikidata intensively. Our conclusion here is that we will invert 
DBpedia by 180 degrees in the future. So instead of taking the main data 
from Wikipedia and Wikidata, we will take it from the sources directly, 
since all data in Wikipedia and Wikidata comes from somewhere else. So 
the new direction is LOD -> DBpedia -> Wikipedia/Wikidata via 
sameAs,equivalentClass/Property mappings.


This is not a solution for the dump size problem per se, because we are 
creating even bigger and more varied and domain-specific knowledge 
graphs and dumps via FlexiFusion. Besides the flexible source 
partitions, we offer a partition by property, where you can simply pick 
the properties you want for your knowledge graph and then docker-load it 
into the SPARQL store of choice. There is no manual yet, but this query 
gives you all 3.8 million birthdates of the new big fused graph: 
https://databus.dbpedia.org/yasgui/


PREFIX dataid: 
PREFIX dataid-cv: 
PREFIX dct: 
PREFIX dcat:  
SELECT DISTINCT ?file WHERE {
 ?dataset dataid:version 
 .

    ?dataset dcat:distribution ?distribution .
    ?distribution  
'birthDate'^^ .

    ?distribution dcat:downloadURL ?file .
}

Where you can filter more here: 
https://databus.dbpedia.org/vehnem/flexifusion/fusion/2019.11.15


During the next year, we will include all European library data into the 
syncing process, several national statistical datasets and other data 
and refine the way to extract exactly the partition you need. It is an 
opportunistic extension to Linked Open Data, where you can select the 
partition you need independent of the IDs or vocab used.


-- Sebastian


[1] https://github.com/dbpedia/Dockerized-DBpedia

[2] https://forum.dbpedia.org/t/dbpedia-dataset-2019-08-30-pre-release/219

[3] https://github.com/dbpedia/minimal-download-client

[4] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

[5] https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE

On 19.12.19 23:15, Aidan Hogan wrote:

Hey all,

Just a general response to all the comments thus far.

- @Marco et al., regarding the WDumper by Benno, this is a very cool 
initiative! In fact I spotted it just *after* posting so I think this 
goes quite some ways towards addressing the general issue raised.


- @Markus, I partially disagree regarding the importance of 
rubber-stamping a "notable dump" on the Wikidata side. I would see 
it's value as being something like the "truthy dump", which I believe 
has been widely used in research for working with a concise sub-set of 
Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to 
be generated by WDumper and published on Zenodo. This may be 
sufficient in terms of making the dump available and reusable for 
research purposes (or even better than the current dumps, given the 
permanence you mention). Also it would reduce costs on the Wikidata 
side (I don't think a notable dump would be necessary to generate on a 
weekly basis, for example).


- @Lydia, good point! I was thinking that filtering by wikilinks will 
just drop some more obscure nodes (like Q51366847 for example), but 
had not considered that there are some more general "concepts" that do 
not have a corresponding Wikipedia article. All the same, in a lot of 
the research we use Wikidata for, we are not particularly interested 
in one thing or another, but more interested in facilitating what 
other people are interested in. Examples would be query performance, 
finding paths, versioning, finding references, etc. But point taken! 
Maybe there is a way to identify "general entities" that do not have 
wikilinks, but do have a high degree or centrality, for example? Would 
a degree-based or centrality-based filter be possible in something 
like WDumper (perhaps it goes beyond the original purpose; certainly 
it does not seem trivial in terms of resources used)? Would it be a 
good idea?


In summary, I like the 

Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-21 Thread Lydia Pintscher
On Sat, Dec 21, 2019 at 6:37 PM Dan Brickley  wrote:
> That is also a fine place to record things! I don’t mean to fork the 
> discussion. Maybe we could have a call for interested parties in the new year?

Yeah that sounds like a good idea.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-21 Thread Dan Brickley
On Sat, 21 Dec 2019 at 17:25, Lydia Pintscher 
wrote:

> On Thu, Dec 19, 2019 at 11:16 PM Aidan Hogan  wrote:
> > - @Lydia, good point! I was thinking that filtering by wikilinks will
> > just drop some more obscure nodes (like Q51366847 for example), but had
> > not considered that there are some more general "concepts" that do not
> > have a corresponding Wikipedia article. All the same, in a lot of the
> > research we use Wikidata for, we are not particularly interested in one
> > thing or another, but more interested in facilitating what other people
> > are interested in. Examples would be query performance, finding paths,
> > versioning, finding references, etc. But point taken! Maybe there is a
> > way to identify "general entities" that do not have wikilinks, but do
> > have a high degree or centrality, for example? Would a degree-based or
> > centrality-based filter be possible in something like WDumper (perhaps
> > it goes beyond the original purpose; certainly it does not seem trivial
> > in terms of resources used)? Would it be a good idea?
>
> I think it's definitely worth exploring but I fear it needs someone to
> actually sit down and collect the different dumps use-cases and talk
> to people to figure out which part of the data they need. Based on
> that we could identify common patterns.


Yeah, there are a bunch of quite varied motivations for subsets.  I have
found the topic of Wikidata subsetting and data dumps coming up again and
again. Most recently in a lifescience/bioinformations setting which is how
we ended up collecting raw materials in the doc already shared here,
https://docs.google.com/document/d/1MmrpEQ9O7xA6frNk6gceu_IbQrUiEYGI9vcQjDvTL9c
but also in other domains. If people here care to drop use cases, thoughts
and notes (*however scrappy*) into that doc I will make a pass over it to
try to pull together a more readable summary of the various motivations for
subsetting.

The work Adam wrote up at
https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
is also very relevant...

(I think this is something
> that needs to be done but unfortunately can't dedicate time to it in
> the foreseeable future. https://phabricator.wikimedia.org/T46581 is a
> good place for people who want to help think it through.


That is also a fine place to record things! I don’t mean to fork the
discussion. Maybe we could have a call for interested parties in the new
year?

Dan



>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-21 Thread Lydia Pintscher
On Thu, Dec 19, 2019 at 11:16 PM Aidan Hogan  wrote:
> - @Lydia, good point! I was thinking that filtering by wikilinks will
> just drop some more obscure nodes (like Q51366847 for example), but had
> not considered that there are some more general "concepts" that do not
> have a corresponding Wikipedia article. All the same, in a lot of the
> research we use Wikidata for, we are not particularly interested in one
> thing or another, but more interested in facilitating what other people
> are interested in. Examples would be query performance, finding paths,
> versioning, finding references, etc. But point taken! Maybe there is a
> way to identify "general entities" that do not have wikilinks, but do
> have a high degree or centrality, for example? Would a degree-based or
> centrality-based filter be possible in something like WDumper (perhaps
> it goes beyond the original purpose; certainly it does not seem trivial
> in terms of resources used)? Would it be a good idea?

I think it's definitely worth exploring but I fear it needs someone to
actually sit down and collect the different dumps use-cases and talk
to people to figure out which part of the data they need. Based on
that we could identify common patterns. (I think this is something
that needs to be done but unfortunately can't dedicate time to it in
the foreseeable future. https://phabricator.wikimedia.org/T46581 is a
good place for people who want to help think it through.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-19 Thread Aidan Hogan

Hey all,

Just a general response to all the comments thus far.

- @Marco et al., regarding the WDumper by Benno, this is a very cool 
initiative! In fact I spotted it just *after* posting so I think this 
goes quite some ways towards addressing the general issue raised.


- @Markus, I partially disagree regarding the importance of 
rubber-stamping a "notable dump" on the Wikidata side. I would see it's 
value as being something like the "truthy dump", which I believe has 
been widely used in research for working with a concise sub-set of 
Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to be 
generated by WDumper and published on Zenodo. This may be sufficient in 
terms of making the dump available and reusable for research purposes 
(or even better than the current dumps, given the permanence you 
mention). Also it would reduce costs on the Wikidata side (I don't think 
a notable dump would be necessary to generate on a weekly basis, for 
example).


- @Lydia, good point! I was thinking that filtering by wikilinks will 
just drop some more obscure nodes (like Q51366847 for example), but had 
not considered that there are some more general "concepts" that do not 
have a corresponding Wikipedia article. All the same, in a lot of the 
research we use Wikidata for, we are not particularly interested in one 
thing or another, but more interested in facilitating what other people 
are interested in. Examples would be query performance, finding paths, 
versioning, finding references, etc. But point taken! Maybe there is a 
way to identify "general entities" that do not have wikilinks, but do 
have a high degree or centrality, for example? Would a degree-based or 
centrality-based filter be possible in something like WDumper (perhaps 
it goes beyond the original purpose; certainly it does not seem trivial 
in terms of resources used)? Would it be a good idea?


In summary, I like the idea of using WDumper to sporadically generate -- 
and publish on Zenodo -- a "notable version" of Wikidata filtered by 
sitelinks (perhaps also allowing other high-degree or high-PageRank 
nodes to pass the filter). At least I know I would use such a dump.


Best,
Aidan

On 2019-12-19 6:46, Lydia Pintscher wrote:

On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan  wrote:


Hey all,

As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and try
things quickly.

More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not linked to
Wikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.

Best,
Aidan


Hi Aiden,

That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-19 Thread Lydia Pintscher
On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan  wrote:
>
> Hey all,
>
> As someone who likes to use Wikidata in their research, and likes to
> give students projects relating to Wikidata, I am finding it more and
> more difficult to (recommend to) work with recent versions of Wikidata
> due to the increasing dump sizes, where even the truthy version now
> costs considerable time and machine resources to process and handle. In
> some cases we just grin and bear the costs, while in other cases we
> apply an ad hoc sampling to be able to play around with the data and try
> things quickly.
>
> More generally, I think the growing data volumes might inadvertently
> scare people off taking the dumps and using them in their research.
>
> One idea we had recently to reduce the data size for a student project
> while keeping the most notable parts of Wikidata was to only keep claims
> that involve an item linked to Wikipedia; in other words, if the
> statement involves a Q item (in the "subject" or "object") not linked to
> Wikipedia, the statement is removed.
>
> I wonder would it be possible for Wikidata to provide such a dump to
> download (e.g., in RDF) for people who prefer to work with a more
> concise sub-graph that still maintains the most "notable" parts? While
> of course one could compute this from the full-dump locally, making such
> a version available as a dump directly would save clients some
> resources, potentially encourage more research using/on Wikidata, and
> having such a version "rubber-stamped" by Wikidata would also help to
> justify the use of such a dataset for research purposes.
>
> ... just an idea I thought I would float out there. Perhaps there is
> another (better) way to define a concise dump.
>
> Best,
> Aidan

Hi Aiden,

That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.


Cheers
Lydia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-19 Thread Markus Kroetzsch

Hi all,

Yes, Benno's WDumper could be used for this purpose. The motivation for 
the whole project was very similar to what Aidan describes. We realised 
thought that there won't be a single good way to build smaller dump that 
would serve every conceivable use in research, which is why the UI let's 
users make custom dumps.


In general, we are happy to hear more ideas on how to build useful 
smaller dumps that would be interesting. We are also accepting pull 
requests.


Benno, could you add a feature to include only items with Wikipedia page 
(in some language, or in any language)?


Edgard, I don't think making this more "official" will be very important 
for most researchers. Benno spent quite some time on aligning the RDF 
export with the official dumps, so in practice, WDumper mostly produces 
a subset of the triples of the official dump (which one could also have 
extracted manually). If there are differences left between the formats, 
we will be happy to hear about them (a github issue would be the best 
way to report it).


As Benno already wrote, WDumper connects to Zenodo to ensure that 
exported datasets are archived in a permanent and citable fashion. This 
is very important for research. As far as I know, none of the existing 
dumps (official or not) guarantee long-term availability at the moment.


Cheers,

Markus



On 18/12/2019 13:37, Edgard Marx wrote:
It certainly helps, however, I think Aidan's suggestion goes into the 
direction of having an official dump distribution.


Imagine how many CO2 can be spared just by avoiding the computational 
resource to recreate this dump every time ones need it.


Besides, it standardise the dataset used for research purposes.

On Wed, Dec 18, 2019, 11:26 Marco Fossati > wrote:


Hi everyone,

Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/

I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.

Cheers,

Marco

On 12/18/19 8:01 AM, Edgard Marx wrote:
 > +1
 >
 > On Tue, Dec 17, 2019, 19:14 Aidan Hogan mailto:aid...@gmail.com>
 > >> wrote:
 >
 >     Hey all,
 >
 >     As someone who likes to use Wikidata in their research, and
likes to
 >     give students projects relating to Wikidata, I am finding it
more and
 >     more difficult to (recommend to) work with recent versions of
Wikidata
 >     due to the increasing dump sizes, where even the truthy
version now
 >     costs considerable time and machine resources to process and
handle. In
 >     some cases we just grin and bear the costs, while in other
cases we
 >     apply an ad hoc sampling to be able to play around with the
data and
 >     try
 >     things quickly.
 >
 >     More generally, I think the growing data volumes might
inadvertently
 >     scare people off taking the dumps and using them in their
research.
 >
 >     One idea we had recently to reduce the data size for a
student project
 >     while keeping the most notable parts of Wikidata was to only keep
 >     claims
 >     that involve an item linked to Wikipedia; in other words, if the
 >     statement involves a Q item (in the "subject" or "object") not
 >     linked to
 >     Wikipedia, the statement is removed.
 >
 >     I wonder would it be possible for Wikidata to provide such a
dump to
 >     download (e.g., in RDF) for people who prefer to work with a more
 >     concise sub-graph that still maintains the most "notable"
parts? While
 >     of course one could compute this from the full-dump locally,
making
 >     such
 >     a version available as a dump directly would save clients some
 >     resources, potentially encourage more research using/on
Wikidata, and
 >     having such a version "rubber-stamped" by Wikidata would also
help to
 >     justify the use of such a dataset for research purposes.
 >
 >     ... just an idea I thought I would float out there. Perhaps
there is
 >     another (better) way to define a concise dump.
 >
 >     Best,
 >     Aidan
 >
 >     ___
 >     Wikidata mailing list
 > Wikidata@lists.wikimedia.org

>
 > https://lists.wikimedia.org/mailman/listinfo/wikidata
 >
 >
 > ___
 > Wikidata mailing list
 > Wikidata@lists.wikimedia.org 
 > https://lists.wikimedia.org/mailman/listinfo/wikidata
 >

___
Wikidata mailing list

Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-18 Thread James Heald

See also this recent discussion/brainstorm on "Wikidata subsetting"

https://docs.google.com/document/d/1MmrpEQ9O7xA6frNk6gceu_IbQrUiEYGI9vcQjDvTL9c/edit#heading=h.7xg3cywpkgfq

In a geographical context, whether or not an item has a Wikipedia entry 
has been contemplated as a criterion for filtering Wikidata for a 
gazetteer by eg the @gbhgis team behind the Vision of Britiain site 
(Humphrey Southall)


But I would caution against using the criterion blindly -- Wikidata 
notability includes "structural need" as an inclusion criterion for good 
reason.  You definitely wouldn't want items becoming disconnected from 
their P31/P279* subclass tree because one of the intermediate items had 
been omitted.


Other such items might be very valuable (or not valuable at all), 
depending on the end-user's application.  So it's definitely non-trivial 
to think what the WD team should leave in for any generic dump intended 
to have wide usefulness.  Even for a specific end-use, one might want to 
think quite carefully about things not to cut out.


  -- James.



On 18/12/2019 12:37, Edgard Marx wrote:

It certainly helps, however, I think Aidan's suggestion goes into the
direction of having an official dump distribution.

Imagine how many CO2 can be spared just by avoiding the computational
resource to recreate this dump every time ones need it.

Besides, it standardise the dataset used for research purposes.

On Wed, Dec 18, 2019, 11:26 Marco Fossati  wrote:


Hi everyone,

Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/

I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.

Cheers,

Marco

On 12/18/19 8:01 AM, Edgard Marx wrote:

+1

On Tue, Dec 17, 2019, 19:14 Aidan Hogan mailto:aid...@gmail.com>> wrote:

 Hey all,

 As someone who likes to use Wikidata in their research, and likes to
 give students projects relating to Wikidata, I am finding it more and
 more difficult to (recommend to) work with recent versions of

Wikidata

 due to the increasing dump sizes, where even the truthy version now
 costs considerable time and machine resources to process and handle.

In

 some cases we just grin and bear the costs, while in other cases we
 apply an ad hoc sampling to be able to play around with the data and
 try
 things quickly.

 More generally, I think the growing data volumes might inadvertently
 scare people off taking the dumps and using them in their research.

 One idea we had recently to reduce the data size for a student

project

 while keeping the most notable parts of Wikidata was to only keep
 claims
 that involve an item linked to Wikipedia; in other words, if the
 statement involves a Q item (in the "subject" or "object") not
 linked to
 Wikipedia, the statement is removed.

 I wonder would it be possible for Wikidata to provide such a dump to
 download (e.g., in RDF) for people who prefer to work with a more
 concise sub-graph that still maintains the most "notable" parts?

While

 of course one could compute this from the full-dump locally, making
 such
 a version available as a dump directly would save clients some
 resources, potentially encourage more research using/on Wikidata, and
 having such a version "rubber-stamped" by Wikidata would also help to
 justify the use of such a dataset for research purposes.

 ... just an idea I thought I would float out there. Perhaps there is
 another (better) way to define a concise dump.

 Best,
 Aidan

 ___
 Wikidata mailing list
 Wikidata@lists.wikimedia.org 
 https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-18 Thread Edgard Marx
It certainly helps, however, I think Aidan's suggestion goes into the
direction of having an official dump distribution.

Imagine how many CO2 can be spared just by avoiding the computational
resource to recreate this dump every time ones need it.

Besides, it standardise the dataset used for research purposes.

On Wed, Dec 18, 2019, 11:26 Marco Fossati  wrote:

> Hi everyone,
>
> Benno (in CC) has recently announced this tool:
> https://tools.wmflabs.org/wdumps/
>
> I haven't checked it out yet, but it sounds related to Aidan's inquiry.
> Hope this helps.
>
> Cheers,
>
> Marco
>
> On 12/18/19 8:01 AM, Edgard Marx wrote:
> > +1
> >
> > On Tue, Dec 17, 2019, 19:14 Aidan Hogan  > > wrote:
> >
> > Hey all,
> >
> > As someone who likes to use Wikidata in their research, and likes to
> > give students projects relating to Wikidata, I am finding it more and
> > more difficult to (recommend to) work with recent versions of
> Wikidata
> > due to the increasing dump sizes, where even the truthy version now
> > costs considerable time and machine resources to process and handle.
> In
> > some cases we just grin and bear the costs, while in other cases we
> > apply an ad hoc sampling to be able to play around with the data and
> > try
> > things quickly.
> >
> > More generally, I think the growing data volumes might inadvertently
> > scare people off taking the dumps and using them in their research.
> >
> > One idea we had recently to reduce the data size for a student
> project
> > while keeping the most notable parts of Wikidata was to only keep
> > claims
> > that involve an item linked to Wikipedia; in other words, if the
> > statement involves a Q item (in the "subject" or "object") not
> > linked to
> > Wikipedia, the statement is removed.
> >
> > I wonder would it be possible for Wikidata to provide such a dump to
> > download (e.g., in RDF) for people who prefer to work with a more
> > concise sub-graph that still maintains the most "notable" parts?
> While
> > of course one could compute this from the full-dump locally, making
> > such
> > a version available as a dump directly would save clients some
> > resources, potentially encourage more research using/on Wikidata, and
> > having such a version "rubber-stamped" by Wikidata would also help to
> > justify the use of such a dataset for research purposes.
> >
> > ... just an idea I thought I would float out there. Perhaps there is
> > another (better) way to define a concise dump.
> >
> > Best,
> > Aidan
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org 
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-18 Thread Marco Fossati

Hi everyone,

Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/

I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.

Cheers,

Marco

On 12/18/19 8:01 AM, Edgard Marx wrote:

+1

On Tue, Dec 17, 2019, 19:14 Aidan Hogan > wrote:


Hey all,

As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and
try
things quickly.

More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep
claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not
linked to
Wikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making
such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.

Best,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-17 Thread Edgard Marx
+1

On Tue, Dec 17, 2019, 19:14 Aidan Hogan  wrote:

> Hey all,
>
> As someone who likes to use Wikidata in their research, and likes to
> give students projects relating to Wikidata, I am finding it more and
> more difficult to (recommend to) work with recent versions of Wikidata
> due to the increasing dump sizes, where even the truthy version now
> costs considerable time and machine resources to process and handle. In
> some cases we just grin and bear the costs, while in other cases we
> apply an ad hoc sampling to be able to play around with the data and try
> things quickly.
>
> More generally, I think the growing data volumes might inadvertently
> scare people off taking the dumps and using them in their research.
>
> One idea we had recently to reduce the data size for a student project
> while keeping the most notable parts of Wikidata was to only keep claims
> that involve an item linked to Wikipedia; in other words, if the
> statement involves a Q item (in the "subject" or "object") not linked to
> Wikipedia, the statement is removed.
>
> I wonder would it be possible for Wikidata to provide such a dump to
> download (e.g., in RDF) for people who prefer to work with a more
> concise sub-graph that still maintains the most "notable" parts? While
> of course one could compute this from the full-dump locally, making such
> a version available as a dump directly would save clients some
> resources, potentially encourage more research using/on Wikidata, and
> having such a version "rubber-stamped" by Wikidata would also help to
> justify the use of such a dataset for research purposes.
>
> ... just an idea I thought I would float out there. Perhaps there is
> another (better) way to define a concise dump.
>
> Best,
> Aidan
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata