Re: [Wikidata] Wikidata HDT dump

2018-10-02 Thread Laura Morales
> You shouldn't have to keep anything in RAM to HDT-ize something as you could 
> make the dictionary by sorting on disk and also do the joins to look up 
> everything against the dictionary by sorting.

Yes but somebody has to write the code for it :)
My understanding is that they keep everything in memory because it was simpler 
to develop. The problem is that graphs can become really huge so this approach 
clearly doesn't scale too well.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2018-10-02 Thread Laura Morales
> 100 GB "with an optimized code" could be enough to produce an HDT like that.

The current software definitely cannot handle wikidata with 100GB. It was tried 
before and it failed.
I'm glad to see that new code will be released to handle large files. After 
skimming that paper it looks like they split the RDF source into multiple files 
and "cat" them into a single HDT file. 100GB is still a pretty large footprint, 
but I'm so glad that they're working on this. A 128GB server is *way* more 
affordable than one with 512GB or 1TB!

I can't wait to try the new code myself.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2018-10-01 Thread Paul Houle
You shouldn't have to keep anything in RAM to HDT-ize something as you 
could make the dictionary by sorting on disk and also do the joins to 
look up everything against the dictionary by sorting.


-- Original Message --
From: "Ettore RIZZA" 
To: "Discussion list for the Wikidata project." 


Sent: 10/1/2018 5:03:59 PM
Subject: Re: [Wikidata] Wikidata HDT dump

> what computer did you use for this? IIRC it required >512GB of RAM to 
function.


Hello Laura,

Sorry for my confusing message, I am not at all a member of the HDT 
team. But according to its creator 
<https://twitter.com/ciutti/status/1046849607114936320>, 100 GB "with 
an optimized code" could be enough to produce an HDT like that.


On Mon, 1 Oct 2018 at 18:59, Laura Morales  wrote:
> a new dump of Wikidata in HDT (with index) is 
available[http://www.rdfhdt.org/datasets/].


Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required 
>512GB of RAM to function.


> You will see how Wikidata has become huge compared to other 
datasets. it contains about twice the limit of 4B triples discussed 
above.


There is a 64-bit version of HDT that doesn't have this limitation of 
4B triples.


> In this regard, what is in 2018 the most user friendly way to use 
this format?


Speaking for me at least, Fuseki with a HDT store. But I know there 
are also some CLI tools from the HDT folks.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2018-10-01 Thread Ettore RIZZA
> what computer did you use for this? IIRC it required >512GB of RAM to
function.

Hello Laura,

Sorry for my confusing message, I am not at all a member of the HDT team.
But according to its creator
, 100 GB "with an
optimized code" could be enough to produce an HDT like that.

On Mon, 1 Oct 2018 at 18:59, Laura Morales  wrote:

> > a new dump of Wikidata in HDT (with index) is available[
> http://www.rdfhdt.org/datasets/].
>
> Thank you very much! Keep it up!
> Out of curiosity, what computer did you use for this? IIRC it required
> >512GB of RAM to function.
>
> > You will see how Wikidata has become huge compared to other datasets. it
> contains about twice the limit of 4B triples discussed above.
>
> There is a 64-bit version of HDT that doesn't have this limitation of 4B
> triples.
>
> > In this regard, what is in 2018 the most user friendly way to use this
> format?
>
> Speaking for me at least, Fuseki with a HDT store. But I know there are
> also some CLI tools from the HDT folks.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2018-10-01 Thread Laura Morales
> a new dump of Wikidata in HDT (with index) is 
> available[http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required >512GB 
of RAM to function.

> You will see how Wikidata has become huge compared to other datasets. it 
> contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B 
triples.

> In this regard, what is in 2018 the most user friendly way to use this format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also 
some CLI tools from the HDT folks.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2018-10-01 Thread Ettore RIZZA
Hello,

a new dump of Wikidata in HDT (with index) is available
<http://www.rdfhdt.org/datasets/>. You will see how Wikidata has become
huge compared to other datasets. it contains about twice the limit of 4B
triples discussed above.

In this regard, what is in 2018 the most user friendly way to use this
format?

BR,

Ettore

On Tue, 7 Nov 2017 at 15:33, Ghislain ATEMEZING <
ghislain.atemez...@gmail.com> wrote:

> Hi Jeremie,
>
> Thanks for this info.
>
> In the meantime, what about making chunks of 3.5Bio triples (or any size
> less than 4Bio) and a script to convert the dataset? Would that be possible
> ?
>
>
>
> Best,
>
> Ghislain
>
>
>
> Provenance : Courrier <https://go.microsoft.com/fwlink/?LinkId=550986>
> pour Windows 10
>
>
>
> *De : *Jérémie Roquet 
> *Envoyé le :*mardi 7 novembre 2017 15:25
> *À : *Discussion list for the Wikidata project.
> 
> *Objet :*Re: [Wikidata] Wikidata HDT dump
>
>
>
> Hi everyone,
>
>
>
> I'm afraid the current implementation of HDT is not ready to handle
>
> more than 4 billions triples as it is limited to 32 bit indexes. I've
>
> opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135
>
>
>
> Until this is addressed, don't waste your time trying to convert the
>
> entire Wikidata to HDT: it can't work.
>
>
>
> --
>
> Jérémie
>
>
>
> ___
>
> Wikidata mailing list
>
> Wikidata@lists.wikimedia.org
>
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-12-15 Thread Stas Malyshev
Hi!

> Somebody pointed me to the following issue:
> https://phabricator.wikimedia.org/T179681  Unfortunately I'm not able
> to log in there with the "Phabricator" so I cannot edit the issue
> directly.  I'm sending this email instead.

Thank you, I've updated the task with references to your comments.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-12-15 Thread Wouter Beek
Hi Wikidata community,

Somebody pointed me to the following issue:
https://phabricator.wikimedia.org/T179681  Unfortunately I'm not able
to log in there with the "Phabricator" so I cannot edit the issue
directly.  I'm sending this email instead.

The issue seems to be stalled because it is not possible to create HDT
files that contain more than 2B triples.  However, this is possible in
a specific 64 bit branch, which is how I created the downloadable
version I've sent a few days ago.  As indicated, I can create these
files for the community if there is a use case.

---
Cheers,
Wouter.

Email: wou...@triply.cc
WWW: http://triply.cc
Tel: +31647674624


On Tue, Dec 12, 2017 at 11:24 AM, Wouter Beek  wrote:
> Hi list,
>
> I'm sorry, I was under the impression that I had already shared this
> resource with you earlier, but I haven't...
>
> On 7 Nov I created an HDT file based on the then current download link
> from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
>
> You can download this HDT file and it's index from the following locations:
>   - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB)
>   - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)
>
> You may need to compile with 64bit support, because there are more
> than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64).
> (To be exact, there are 4,579,973,187 triples in this file.)
>
> PS: If this resource turns out to be useful to the community we can
> offer an updated HDT file at a to be determined interval.
>
> ---
> Cheers,
> Wouter Beek.
>
> Email: wou...@triply.cc
> WWW: http://triply.cc
> Tel: +31647674624
>
> On Tue, Nov 7, 2017 at 6:31 PM, Laura Morales  wrote:
>>> drops `a wikibase:Item` and `a wikibase:Statement` types
>>
>> off topic but... why drop `a wikibase:Item`? Without this it seems 
>> impossible to retrieve a list of items.
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-12-12 Thread Laura Morales
* T H A N KY O U *

> On 7 Nov I created an HDT file based on the then current download link
> from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

Thank you very very much Wouter!! This is great!
Out of curiosity, could you please share some info about the machine that 
you've used to generate these files? In particular I mean hardware info, such 
as the model names of mobo/cpu/ram/disks. Also "how long" it took to generate 
these files would be an interesting information.

> PS: If this resource turns out to be useful to the community we can
> offer an updated HDT file at a to be determined interval.

This would be fantastic! Wikidata dumps about once a week, so I think even a 
new HDT file every 1-2 months would be awesome.
Related to this however... why not use the Laundromat for this? There are 
several datasets that are very large, and rdf2hdt is really expensive to run. 
Maybe you could schedule regular jobs for several graphs (wikidata, dbpedia, 
wordnet, linkedgeodata, government data, ...) and make them available at the 
Laundromat?

* T H A N KY O U *

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-12-12 Thread Wouter Beek
Hi list,

I'm sorry, I was under the impression that I had already shared this
resource with you earlier, but I haven't...

On 7 Nov I created an HDT file based on the then current download link
from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

You can download this HDT file and it's index from the following locations:
  - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB)
  - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)

You may need to compile with 64bit support, because there are more
than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64).
(To be exact, there are 4,579,973,187 triples in this file.)

PS: If this resource turns out to be useful to the community we can
offer an updated HDT file at a to be determined interval.

---
Cheers,
Wouter Beek.

Email: wou...@triply.cc
WWW: http://triply.cc
Tel: +31647674624

On Tue, Nov 7, 2017 at 6:31 PM, Laura Morales  wrote:
>> drops `a wikibase:Item` and `a wikibase:Statement` types
>
> off topic but... why drop `a wikibase:Item`? Without this it seems impossible 
> to retrieve a list of items.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Laura Morales
> drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible 
to retrieve a list of items.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Jérémie Roquet
2017-11-07 15:32 GMT+01:00 Ghislain ATEMEZING :
> In the meantime, what about making chunks of 3.5Bio triples (or any size
> less than 4Bio) and a script to convert the dataset? Would that be possible?

That seems possible to me, but I wonder if cutting the dataset in
independent clusters is not a bigger undertaking compared to making
HDT handle bigger datasets (I'm not saying it is, I've really no
idea).

Best regards,

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Jérémie Roquet
2017-11-07 17:09 GMT+01:00 Laura Morales :
> How many triples does wikidata have? The old dump from rdfhdt seem to have 
> about 2 billion, which means wikidata doubled the number of triples in less 
> than a year?

A naive grep | wc -l on the last turtle dump gives me an estimate of
4.65 billions triples.

Looking at https://tools.wmflabs.org/wikidata-todo/stats.php it seems
that Wikidata is indeed more than twice as big as only six months ago.

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Lucas Werkmeister
The Wikidata Query Service currently holds some 3.8 billion triples –
you can see the numbers on Grafana [1]. But WDQS “munges” the dump
before importing it – for instance, it merges wdata:… into wd:… and
drops `a wikibase:Item` and `a wikibase:Statement` types; see [2] for
details – so the triple count in the un-munged dump will be somewhat
larger than the triple count in WDQS.

Cheers,
Lucas

[1]:
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?panelId=7
[2]:
https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences


On 07.11.2017 17:09, Laura Morales wrote:
> How many triples does wikidata have? The old dump from rdfhdt seem to have 
> about 2 billion, which means wikidata doubled the number of triples in less 
> than a year?
>  
>  
>
> Sent: Tuesday, November 07, 2017 at 3:24 PM
> From: "Jérémie Roquet" <jroq...@arkanosis.net>
> To: "Discussion list for the Wikidata project." <wikidata@lists.wikimedia.org>
> Subject: Re: [Wikidata] Wikidata HDT dump
> Hi everyone,
>
> I'm afraid the current implementation of HDT is not ready to handle
> more than 4 billions triples as it is limited to 32 bit indexes. I've
> opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135
>
> Until this is addressed, don't waste your time trying to convert the
> entire Wikidata to HDT: it can't work.
>
> --
> Jérémie
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Laura Morales
How many triples does wikidata have? The old dump from rdfhdt seem to have 
about 2 billion, which means wikidata doubled the number of triples in less 
than a year?
 
 

Sent: Tuesday, November 07, 2017 at 3:24 PM
From: "Jérémie Roquet" <jroq...@arkanosis.net>
To: "Discussion list for the Wikidata project." <wikidata@lists.wikimedia.org>
Subject: Re: [Wikidata] Wikidata HDT dump
Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

--
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-07 Thread Jérémie Roquet
Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales
> I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) 
> for providing a HDT dump, let’s see if someone else (ideally from the ops 
> team) responds to it. (I’m not familiar with the systems we currently use for 
> the dumps, so I can’t say if they have enough resources for this.)

Thank you Lucas!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Lucas Werkmeister
I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681)
for providing a HDT dump, let’s see if someone else (ideally from the ops
team) responds to it. (I’m not familiar with the systems we currently use
for the dumps, so I can’t say if they have enough resources for this.)

Cheers,
Lucas

2017-11-03 10:29 GMT+01:00 Laura Morales :

> > Thank you for this feedback, Laura.
> > Is the hdt index you got available somewhere on the cloud?
>
> Unfortunately it's not. It was a private link that was temporarily shared
> with me by email. I guess I could re-upload the file somewhere else myself,
> but my uplink is really slow (1Mbps).
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Lucas Werkmeister
Software Developer (Intern)

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de

Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales
> Thank you for this feedback, Laura. 
> Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared with 
me by email. I guess I could re-upload the file somewhere else myself, but my 
uplink is really slow (1Mbps).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Ettore RIZZA
Thank you very much, Jasper !

2017-11-03 10:15 GMT+01:00 Jasper Koehorst :

> I am uploading the index file temporarily to:
>
> http://fungen.wur.nl/~jasperk/WikiData/
>
> Jasper
>
>
> On 3 Nov 2017, at 10:05, Ettore RIZZA  wrote:
>
> Thank you for this feedback, Laura.
>
> Is the hdt index you got available somewhere on the cloud?
>
> Cheers
>
> 2017-11-03 9:56 GMT+01:00 Osma Suominen :
>
>> Hi Laura,
>>
>> Thank you for sharing your experience! I think your example really shows
>> the power - and limitations - of HDT technology for querying very large RDF
>> data sets. While I don't currently have any use case for a local, queryable
>> Wikidata dump, I can easily see that it could be very useful for doing e.g.
>> resource-intensive, analytic queries. Having access to a recent hdt+index
>> dump of Wikidata would make it very easy to start doing that. So I second
>> your plea.
>>
>> -Osma
>>
>>
>> Laura Morales kirjoitti 03.11.2017 klo 09:48:
>>
>>> Hello list,
>>>
>>> a very kind person from this list has generated the .hdt.index file for
>>> me, using the 1-year old wikidata HDT file available at the rdfhdt website.
>>> So I was finally able to setup a working local endpoint using HDT+Fuseki.
>>> Set up was easy, launch time (for Fuseki) also was quick (a few seconds),
>>> the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki
>>> startup script (btw I'm not very proficient in Java, so I hope this is the
>>> correct way). I've ran some queries too. Simple select or traversal queries
>>> seems fast to me (I haven't measured them but the response is almost
>>> immediate), other queries such as "select distinct ?class where { [] a
>>> ?class }" takes several seconds or a few minutes to complete, which kinda
>>> tells me the HDT indexes don't work well on all queries. But otherwise for
>>> simple queries it works perfectly! At least I'm able to query the dataset!
>>> In conclusion, I think this more or less gives some positive feedback
>>> for using HDT on a "commodity computer", which means it can be very useful
>>> for people like me who want to use the dataset locally but who can't setup
>>> a full-blown server. If others want to try as well, they can offer more
>>> (hopefully positive) feedback.
>>> For all of this, I heartwarmingly plea any wikidata dev to please
>>> consider scheduling a HDT dump (.hdt + .hdt.index) along with the other
>>> regular dumps that it creates weekly.
>>>
>>> Thank you!!
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>>
>>
>> --
>> Osma Suominen
>> D.Sc. (Tech), Information Systems Specialist
>> National Library of Finlan
>> d
>> P.O. Box 26 (Kaikukatu 4)
>> 00014 HELSINGIN YLIOPISTO
>> Tel. +358 50 3199529
>> osma.suomi...@helsinki.fi
>> http://www.nationallibrary.fi
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Jasper Koehorst
I am uploading the index file temporarily to:

http://fungen.wur.nl/~jasperk/WikiData/ 


Jasper


> On 3 Nov 2017, at 10:05, Ettore RIZZA  wrote:
> 
> Thank you for this feedback, Laura. 
> 
> Is the hdt index you got available somewhere on the cloud?
> 
> Cheers
> 
> 
> 2017-11-03 9:56 GMT+01:00 Osma Suominen  >:
> Hi Laura,
> 
> Thank you for sharing your experience! I think your example really shows the 
> power - and limitations - of HDT technology for querying very large RDF data 
> sets. While I don't currently have any use case for a local, queryable 
> Wikidata dump, I can easily see that it could be very useful for doing e.g. 
> resource-intensive, analytic queries. Having access to a recent hdt+index 
> dump of Wikidata would make it very easy to start doing that. So I second 
> your plea.
> 
> -Osma
> 
> 
> Laura Morales kirjoitti 03.11.2017 klo 09:48:
> Hello list,
> 
> a very kind person from this list has generated the .hdt.index file for me, 
> using the 1-year old wikidata HDT file available at the rdfhdt website. So I 
> was finally able to setup a working local endpoint using HDT+Fuseki. Set up 
> was easy, launch time (for Fuseki) also was quick (a few seconds), the only 
> change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script 
> (btw I'm not very proficient in Java, so I hope this is the correct way). 
> I've ran some queries too. Simple select or traversal queries seems fast to 
> me (I haven't measured them but the response is almost immediate), other 
> queries such as "select distinct ?class where { [] a ?class }" takes several 
> seconds or a few minutes to complete, which kinda tells me the HDT indexes 
> don't work well on all queries. But otherwise for simple queries it works 
> perfectly! At least I'm able to query the dataset!
> In conclusion, I think this more or less gives some positive feedback for 
> using HDT on a "commodity computer", which means it can be very useful for 
> people like me who want to use the dataset locally but who can't setup a 
> full-blown server. If others want to try as well, they can offer more 
> (hopefully positive) feedback.
> For all of this, I heartwarmingly plea any wikidata dev to please consider 
> scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps 
> that it creates weekly.
> 
> Thank you!!
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikidata 
> 
> 
> 
> 
> -- 
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finlan 
> d
> P.O. Box 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 
> osma.suomi...@helsinki.fi 
> http://www.nationallibrary.fi 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikidata 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Ettore RIZZA
Thank you for this feedback, Laura.

Is the hdt index you got available somewhere on the cloud?

Cheers

2017-11-03 9:56 GMT+01:00 Osma Suominen :

> Hi Laura,
>
> Thank you for sharing your experience! I think your example really shows
> the power - and limitations - of HDT technology for querying very large RDF
> data sets. While I don't currently have any use case for a local, queryable
> Wikidata dump, I can easily see that it could be very useful for doing e.g.
> resource-intensive, analytic queries. Having access to a recent hdt+index
> dump of Wikidata would make it very easy to start doing that. So I second
> your plea.
>
> -Osma
>
>
> Laura Morales kirjoitti 03.11.2017 klo 09:48:
>
>> Hello list,
>>
>> a very kind person from this list has generated the .hdt.index file for
>> me, using the 1-year old wikidata HDT file available at the rdfhdt website.
>> So I was finally able to setup a working local endpoint using HDT+Fuseki.
>> Set up was easy, launch time (for Fuseki) also was quick (a few seconds),
>> the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki
>> startup script (btw I'm not very proficient in Java, so I hope this is the
>> correct way). I've ran some queries too. Simple select or traversal queries
>> seems fast to me (I haven't measured them but the response is almost
>> immediate), other queries such as "select distinct ?class where { [] a
>> ?class }" takes several seconds or a few minutes to complete, which kinda
>> tells me the HDT indexes don't work well on all queries. But otherwise for
>> simple queries it works perfectly! At least I'm able to query the dataset!
>> In conclusion, I think this more or less gives some positive feedback for
>> using HDT on a "commodity computer", which means it can be very useful for
>> people like me who want to use the dataset locally but who can't setup a
>> full-blown server. If others want to try as well, they can offer more
>> (hopefully positive) feedback.
>> For all of this, I heartwarmingly plea any wikidata dev to please
>> consider scheduling a HDT dump (.hdt + .hdt.index) along with the other
>> regular dumps that it creates weekly.
>>
>> Thank you!!
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finlan
> d
> P.O. Box 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.suomi...@helsinki.fi
> http://www.nationallibrary.fi
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Osma Suominen

Hi Laura,

Thank you for sharing your experience! I think your example really shows 
the power - and limitations - of HDT technology for querying very large 
RDF data sets. While I don't currently have any use case for a local, 
queryable Wikidata dump, I can easily see that it could be very useful 
for doing e.g. resource-intensive, analytic queries. Having access to a 
recent hdt+index dump of Wikidata would make it very easy to start doing 
that. So I second your plea.


-Osma

Laura Morales kirjoitti 03.11.2017 klo 09:48:

Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 
1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to 
setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for 
Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m 
to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope 
this is the correct way). I've ran some queries too. Simple select or traversal queries 
seems fast to me (I haven't measured them but the response is almost immediate), other 
queries such as "select distinct ?class where { [] a ?class }" takes several 
seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work 
well on all queries. But otherwise for simple queries it works perfectly! At least I'm 
able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using HDT on a 
"commodity computer", which means it can be very useful for people like me who 
want to use the dataset locally but who can't setup a full-blown server. If others want 
to try as well, they can offer more (hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider 
scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps 
that it creates weekly.

Thank you!!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-03 Thread Laura Morales
Hello list,

a very kind person from this list has generated the .hdt.index file for me, 
using the 1-year old wikidata HDT file available at the rdfhdt website. So I 
was finally able to setup a working local endpoint using HDT+Fuseki. Set up was 
easy, launch time (for Fuseki) also was quick (a few seconds), the only change 
I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm 
not very proficient in Java, so I hope this is the correct way). I've ran some 
queries too. Simple select or traversal queries seems fast to me (I haven't 
measured them but the response is almost immediate), other queries such as 
"select distinct ?class where { [] a ?class }" takes several seconds or a few 
minutes to complete, which kinda tells me the HDT indexes don't work well on 
all queries. But otherwise for simple queries it works perfectly! At least I'm 
able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using 
HDT on a "commodity computer", which means it can be very useful for people 
like me who want to use the dataset locally but who can't setup a full-blown 
server. If others want to try as well, they can offer more (hopefully positive) 
feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider 
scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps 
that it creates weekly.

Thank you!!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-02 Thread Osma Suominen

Laura Morales kirjoitti 02.11.2017 klo 15:54:

The tool is in the hdt-jena package (not hdt-java-cli where the other

command line tools reside), since it uses parts of Jena (e.g. ARQ).

There is a wrapper script called hdtsparql.sh for executing it with the

proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it 
need to load the whole graph+index into memory?


I haven't tested it with huge datasets like Wikidata. But for the 
moderately sized (40M triples) data that I use it for, it runs pretty 
fast and without using lots of memory, so I think it just memory maps 
the hdt and index file and reads only what it needs to answer the query.


-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-02 Thread Laura Morales
> There is also a command line tool called hdtsparql in the hdt-java
distribution that allows exactly this. It used to support only SELECT
queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK
queries too. There are some limitations, for example only CSV output is
supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE.

Thank you for sharing.

> The tool is in the hdt-jena package (not hdt-java-cli where the other
command line tools reside), since it uses parts of Jena (e.g. ARQ).
> There is a wrapper script called hdtsparql.sh for executing it with the
proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it 
need to load the whole graph+index into memory?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-02 Thread Osma Suominen

Laura Morales kirjoitti 01.11.2017 klo 08:59:

For querying I use the Jena query engine. I have created a module called 
HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple 
program and under development that should be able to use the full power of 
SPARQL and be more advanced than grep… ;)


Does this tool allow to query HDT files from command-line, with SPARQL, and 
without the need to setup a Fuseki endpoint?


There is also a command line tool called hdtsparql in the hdt-java 
distribution that allows exactly this. It used to support only SELECT 
queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK 
queries too. There are some limitations, for example only CSV output is 
supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE. But it 
works fine, at least for my use cases, and is often more convenient than 
firing up Fuseki-HDT. It requires both the hdt file and the 
corresponding index file.


Code here:
https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java

The tool is in the hdt-jena package (not hdt-java-cli where the other 
command line tools reside), since it uses parts of Jena (e.g. ARQ). 
There is a wrapper script called hdtsparql.sh for executing it with the 
proper Java environment.


Typical usage (example from hdt-java README):

# Execute SPARQL Query against the file.
$ ./hdtsparql.sh ../hdt-java/data/test.hdt "SELECT ?s ?p ?o WHERE { ?s 
?p ?o . }"


-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Jasper Koehorst
Yes you just run it should get a sufficient help and if not… I am more than 
happy to polish the code… 

java -jar /Users/jasperkoehorst/Downloads/HDTQuery.jar 
The following option is required: -query 
Usage:  [options]
  Options:
--help

-debug
  Debug mode
  Default: false
-e
  SPARQL endpoint
-f
  Output format, csv / tsv
  Default: csv
-i
  HDT input file(s) for querying (comma separated)
-o
  Query result file
  * -query
  SPARQL Query or FILE containing the query to execute

  * required parameter


> On 1 Nov 2017, at 07:59, Laura Morales  wrote:
> 
>> I am currently downloading the latest ttl file. On a 250gig ram machine. I 
>> will see if that is sufficient to run the conversion Otherwise we have 
>> another busy one with  around 310 gig.
> 
> Thank you!
> 
>> For querying I use the Jena query engine. I have created a module called 
>> HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple 
>> program and under development that should be able to use the full power of 
>> SPARQL and be more advanced than grep… ;)
> 
> Does this tool allow to query HDT files from command-line, with SPARQL, and 
> without the need to setup a Fuseki endpoint?
> 
>> If this all works out I will see with our department if we can set up if it 
>> is still needed a weekly cron job to convert the TTL file. But as it is 
>> growing rapidly we might run into memory issues later?
> 
> Thank you!


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Laura Morales
> I am currently downloading the latest ttl file. On a 250gig ram machine. I 
> will see if that is sufficient to run the conversion Otherwise we have 
> another busy one with  around 310 gig.

Thank you!

> For querying I use the Jena query engine. I have created a module called 
> HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple 
> program and under development that should be able to use the full power of 
> SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and 
without the need to setup a Fuseki endpoint?

> If this all works out I will see with our department if we can set up if it 
> is still needed a weekly cron job to convert the TTL file. But as it is 
> growing rapidly we might run into memory issues later?

Thank you!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Jasper Koehorst
We are actually planning to buy a new barebone server and they are around 
E2500,-. With barely any memory. I will check later with sales, 16gig ram 
strips are around max 200 euros so below 10K should be sufficient?


> On 1 Nov 2017, at 07:47, Laura Morales  wrote:
> 
>> It's a machine with 378 GiB of RAM and 64 threads running Scientific
>> Linux 7.2, that we use mainly for benchmarks.
>> 
>> Building the index was really all about memory because the CPUs have
>> actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
>> to those of my regular workstation, which was unable to build it.
> 
> 
> If your regular workstation was using more CPU, I guess it was because of 
> swapping. Thanks for the statistics, it means a "commodity" CPU could handle 
> this fine, the bottleneck is RAM. I wonder how expensive it is to buy a 
> machine like yours... it sounds like in the $30K-$50K range?
> 
> 
>> You're right. The limited query language of hdtSearch is closer to
>> grep than to SPARQL.
>> 
>> Thank you for pointing out Fuseki, I'll have a look at it.
> 
> 
> I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist 
> (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, 
> although my HDT files are all small. Feel free to drop me an email if you 
> need any help setting up Fuseki.
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Jasper Koehorst
Hello,

I am currently downloading the latest ttl file. On a 250gig ram machine. I will 
see if that is sufficient to run the conversion Otherwise we have another busy 
one with  around 310 gig.
For querying I use the Jena query engine. I have created a module called 
HDTQuery located http://download.systemsbiology.nl/sapp/ 
 which is a simple program and under 
development that should be able to use the full power of SPARQL and be more 
advanced than grep… ;)

If this all works out I will see with our department if we can set up if it is 
still needed a weekly cron job to convert the TTL file. But as it is growing 
rapidly we might run into memory issues later?





> On 1 Nov 2017, at 00:32, Stas Malyshev  wrote:
> 
> Hi!
> 
>> OK. I wonder though, if it would be possible to setup a regular HDT
>> dump alongside the already regular dumps. Looking at the dumps page,
>> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
>> new dump is generated once a week more or less. So if a HDT dump
>> could
> 
> True, the dumps run weekly. "More or less" situation can arise only if
> one of the dumps fail (either due to a bug or some sort of external
> force majeure).
> -- 
> Stas Malyshev
> smalys...@wikimedia.org
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-11-01 Thread Laura Morales
> It's a machine with 378 GiB of RAM and 64 threads running Scientific
> Linux 7.2, that we use mainly for benchmarks.
> 
> Building the index was really all about memory because the CPUs have
> actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
> to those of my regular workstation, which was unable to build it.


If your regular workstation was using more CPU, I guess it was because of 
swapping. Thanks for the statistics, it means a "commodity" CPU could handle 
this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine 
like yours... it sounds like in the $30K-$50K range?


> You're right. The limited query language of hdtSearch is closer to
> grep than to SPARQL.
> 
> Thank you for pointing out Fuseki, I'll have a look at it.


I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist 
(yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, 
although my HDT files are all small. Feel free to drop me an email if you need 
any help setting up Fuseki.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread sushil dutt
Please take me out from these conversations.

On Wed, Nov 1, 2017 at 5:02 AM, Stas Malyshev 
wrote:

> Hi!
>
> > OK. I wonder though, if it would be possible to setup a regular HDT
> > dump alongside the already regular dumps. Looking at the dumps page,
> > https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
> > new dump is generated once a week more or less. So if a HDT dump
> > could
>
> True, the dumps run weekly. "More or less" situation can arise only if
> one of the dumps fail (either due to a bug or some sort of external
> force majeure).
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Regards,
Sushil Dutt
8800911840
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Stas Malyshev
Hi!

> OK. I wonder though, if it would be possible to setup a regular HDT
> dump alongside the already regular dumps. Looking at the dumps page,
> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
> new dump is generated once a week more or less. So if a HDT dump
> could

True, the dumps run weekly. "More or less" situation can arise only if
one of the dumps fail (either due to a bug or some sort of external
force majeure).
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jérémie Roquet
2017-10-31 21:27 GMT+01:00 Laura Morales :
>> I've just loaded the provided hdt file on a big machine (32 GiB wasn't
> enough to build the index but ten times this is more than enough)
> Could you please share a bit about your setup? Do you have a machine with 
> 320GB of RAM?

It's a machine with 378 GiB of RAM and 64 threads running Scientific
Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have
actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
to those of my regular workstation, which was unable to build it.

> Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd 
> be interested to read your results on this too.

As I'm also looking for up-to-date results, so I plan do it with the
last turtle dump as soon as I have a time slot for it; I'll let you
know about the outcome.

>> I'll try to run a few queries to see how it behaves.
>
> I don't think there is a command-line tool to parse SPARQL queries, so you 
> probably have to setup a Fuseki endpoint which uses HDT as a data source.

You're right. The limited query language of hdtSearch is closer to
grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
> I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough)


Could you please share a bit about your setup? Do you have a machine with 320GB 
of RAM?
Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd 
be interested to read your results on this too.
Thank you!


> I'll try to run a few queries to see how it behaves.


I don't think there is a command-line tool to parse SPARQL queries, so you 
probably have to setup a Fuseki endpoint which uses HDT as a data source.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
Hola,
Please don’t get me wrong and don’t give any interpretation based on my 
question. 
Since the beginning of this thread, I am also trying to push the use of HDT 
here. For example, I was the one contacting HDT gurus to fix the dataset error 
on Twitter and so on...

Sorry if Laura or any one thought I was giving “some lessons here “. I don’t 
have a super computer either nor a member of Wikidata team. Just a “data 
consumer” as many here ..

Best,
Ghislain 

Sent from my iPhone, may include typos

> Le 31 oct. 2017 à 20:44, Luigi Assom  a écrit :
> 
> Doh what's wrong with asking for supporting own user case "UC" ?
> 
> I think it is a totally legit question to ask, and that's why this thread 
> exists.
> 
> Also, I do support for possibility to help access to data that would be hard 
> to process from "common" hardware. Especially in the case of open data.
> They exists to allow someone take them and build them - amazing if can 
> prototype locally, right?
> 
> I don't like the use case where a data-scientist-or-IT show to the other 
> data-scientist-or-IT own work looking for emotional support or praise.
> I've seen that, not here, and I hope this attitude stays indeed out from 
> here..
> 
> I do like when the work of data-scientist-or-IT ignites someone else's 
> creativity - someone who is completely external - , to say: hey your work is 
> cool and I wanna use it for... my use case!
> That's how ideas go around and help other people build complexity over them, 
> without constructing not necessary borders.
> 
> About a local version of compressed, index RDF - I think that if was 
> available, more people yes probably would use it.
> 
> 
> 
>> On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales  wrote:
>> I feel like you are misrepresenting my request, and possibly trying to 
>> offend me as well.
>> 
>> My "UC" as you call it, is simply that I would like to have a local copy of 
>> wikidata, and query it using SPARQL. Everything that I've tried so far 
>> doesn't seem to work on commodity hardware since the database is so large. 
>> But HDT could work. So I asked if a HDT dump could, please, be added to 
>> other dumps that are periodically generated by wikidata. I also told you 
>> already that *I AM* trying to use the 1 year old dump, but in order to use 
>> the HDT tools I'm told that I *MUST* generate some other index first which 
>> unfortunately I can't generate for the same reasons that I can convert the 
>> Turtle to HDT. So what I was trying to say is, that if wikidata were to add 
>> any HDT dump, this dump should contain both the .hdt file and .hdt.index in 
>> order to be useful. That's about it, and it's not just about me. Anybody who 
>> wants to have a local copy of wikidata could benefit from this, since 
>> setting up a .hdt file seems much easier than a Turtle dump. And I don't 
>> understand why you're trying to blame me for this?
>> 
>> If you are part of the wikidata dev team, I'd greatly appreciate a 
>> "can/can't" or "don't care" response rather than playing the 
>> passive-aggressive game that you displayed in your last email.
>> 
>> 
>> > Let me try to understand ...
>> > You are a "data consumer" with the following needs:
>> >   - Latest version of the data
>> >   - Quick access to the data
>> >   - You don't want to use the current ways to access the data by the 
>> > publisher (endpoint, ttl dumps, LDFragments)
>> >  However, you ask for a binary format (HDT), but you don't have enough 
>> > memory to set up your own environment/endpoint due to lack of memory.
>> > For that reason, you are asking the publisher to support both .hdt and 
>> > .hdt.index files.
>> >
>> > Do you think there are many users with your current UC?
>> 
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Luigi Assom
Doh what's wrong with asking for supporting own user case "UC" ?

I think it is a totally legit question to ask, and that's why this thread
exists.

Also, I do support for possibility to help access to data that would be
hard to process from "common" hardware. Especially in the case of open data.
They exists to allow someone take them and build them - amazing if can
prototype locally, right?

I don't like the use case where a data-scientist-or-IT show to the other
data-scientist-or-IT own work looking for emotional support or praise.
I've seen that, not here, and I hope this attitude stays indeed out from
here..

I do like when the work of data-scientist-or-IT ignites someone else's
creativity - someone who is completely external - , to say: hey your work
is cool and I wanna use it for... my use case!
That's how ideas go around and help other people build complexity over
them, without constructing not necessary borders.

About a local version of compressed, index RDF - I think that if was
available, more people yes probably would use it.



On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales  wrote:

> I feel like you are misrepresenting my request, and possibly trying to
> offend me as well.
>
> My "UC" as you call it, is simply that I would like to have a local copy
> of wikidata, and query it using SPARQL. Everything that I've tried so far
> doesn't seem to work on commodity hardware since the database is so large.
> But HDT could work. So I asked if a HDT dump could, please, be added to
> other dumps that are periodically generated by wikidata. I also told you
> already that *I AM* trying to use the 1 year old dump, but in order to use
> the HDT tools I'm told that I *MUST* generate some other index first which
> unfortunately I can't generate for the same reasons that I can convert the
> Turtle to HDT. So what I was trying to say is, that if wikidata were to add
> any HDT dump, this dump should contain both the .hdt file and .hdt.index in
> order to be useful. That's about it, and it's not just about me. Anybody
> who wants to have a local copy of wikidata could benefit from this, since
> setting up a .hdt file seems much easier than a Turtle dump. And I don't
> understand why you're trying to blame me for this?
>
> If you are part of the wikidata dev team, I'd greatly appreciate a
> "can/can't" or "don't care" response rather than playing the
> passive-aggressive game that you displayed in your last email.
>
>
> > Let me try to understand ...
> > You are a "data consumer" with the following needs:
> >   - Latest version of the data
> >   - Quick access to the data
> >   - You don't want to use the current ways to access the data by the
> publisher (endpoint, ttl dumps, LDFragments)
> >  However, you ask for a binary format (HDT), but you don't have enough
> memory to set up your own environment/endpoint due to lack of memory.
> > For that reason, you are asking the publisher to support both .hdt and
> .hdt.index files.
> >
> > Do you think there are many users with your current UC?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Jérémie Roquet
2017-10-31 14:56 GMT+01:00 Laura Morales :
> 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. 
> query) require to build an index before I can use the HDT file. I've tried to 
> create the index, but I ran out of memory again (even though the index is 
> smaller than the .hdt file itself). So any Wikidata dump should contain both 
> the .hdt file and the .hdt.index file unless there is another way to generate 
> the index on commodity hardware

I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough), so
here are a few interesting metrics:
 - the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and
~6.5 GiB xzipped ;
 - once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ;
 - right after index generation, it includes ~16 GiB of anonymous
memory (with no memory pressure, that's ~26 GiB resident)…
 - …but after a reload, the index is memory mapped as well, so it only
includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).

Looks like a good candidate for commodity hardware, indeed. It loads
in less than one second on a 32 GiB machine. I'll try to run a few
queries to see how it behaves.

FWIW, my use case is very similar to yours, as I'd like to run queries
that are too long for the public SPARQL endpoint and can't dedicate a
powerful machine do this full time (Blazegraph runs fine with 32 GiB,
though — it just takes a while to index and updating is not as fast as
the changes happening on wikidata.org).

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
Interesting use case Laura! Your UC is rather "special" :)
Let me try to understand ...
You are a "data consumer" with the following needs:
  - Latest version of the data
  - Quick access to the data
  - You don't want to use the current ways to access the data by the
publisher (endpoint, ttl dumps, LDFragments)
 However, you ask for a binary format (HDT), but you don't have enough
memory to set up your own environment/endpoint due to lack of memory.
For that reason, you are asking the publisher to support both .hdt and
.hdt.index files.

Do you think there are many users with your current UC?


El mar., 31 oct. 2017 a las 14:56, Laura Morales ()
escribió:

> > @Laura: I suspect Wouter wants to know if he "ignores" the previous
> errors and proposes a rather incomplete dump (just for you) or waits for
> Stas' feedback.
>
>
> OK. I wonder though, if it would be possible to setup a regular HDT dump
> alongside the already regular dumps. Looking at the dumps page,
> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new
> dump is generated once a week more or less. So if a HDT dump could be added
> to the schedule, it should show up with the next dump and then so forth
> with the future dumps. Right now even the Turtle dump contains the bad
> triples, so adding a HDT file now would not introduce more inconsistencies.
> The problem will be fixed automatically with the future dumps once the
> Turtle is fixed (because the HDT is generated from the .ttl file anyway).
>
>
> > Btw why don't you use the oldest version in HDT website?
>
>
> 1. I have downloaded it and I'm trying to use it, but the HDT tools (eg.
> query) require to build an index before I can use the HDT file. I've tried
> to create the index, but I ran out of memory again (even though the index
> is smaller than the .hdt file itself). So any Wikidata dump should contain
> both the .hdt file and the .hdt.index file unless there is another way to
> generate the index on commodity hardware
>
> 2. because it's 1 year old :)
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
> @Laura: I suspect Wouter wants to know if he "ignores" the previous errors 
> and proposes a rather incomplete dump (just for you) or waits for Stas' 
> feedback.


OK. I wonder though, if it would be possible to setup a regular HDT dump 
alongside the already regular dumps. Looking at the dumps page, 
https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is 
generated once a week more or less. So if a HDT dump could be added to the 
schedule, it should show up with the next dump and then so forth with the 
future dumps. Right now even the Turtle dump contains the bad triples, so 
adding a HDT file now would not introduce more inconsistencies. The problem 
will be fixed automatically with the future dumps once the Turtle is fixed 
(because the HDT is generated from the .ttl file anyway).


> Btw why don't you use the oldest version in HDT website?


1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) 
require to build an index before I can use the HDT file. I've tried to create 
the index, but I ran out of memory again (even though the index is smaller than 
the .hdt file itself). So any Wikidata dump should contain both the .hdt file 
and the .hdt.index file unless there is another way to generate the index on 
commodity hardware

2. because it's 1 year old :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Ghislain ATEMEZING
@Laura: I suspect Wouter wants to know if he "ignores" the previous errors
and proposes a rather incomplete dump (just for you) or waits for Stas'
feedback.
Btw why don't you use the oldest version in HDT website?

El mar., 31 oct. 2017 a las 7:53, Laura Morales ()
escribió:

> @Wouter
>
> > Thanks for the pointer!  I'm downloading from
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.
>
> Any luck so far?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Laura Morales
@Wouter

> Thanks for the pointer!  I'm downloading from 
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

Any luck so far?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Edgard Marx
On Sat, Oct 28, 2017 at 2:31 PM, Laura Morales  wrote:

> > KBox is an alternative to other existing architectures for publishing KB
> such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
> > I should add that you can do federated query with KBox as as easier as
> you can do with SPARQL endpoints.
>
>
> OK, but I still fail to see what is the value of this? What's the reason
> why I'd want to use it rather than just start a Fuseki endpoint, or use
> linked-fragments?
>

I agree that KBox is not indicated to all scenarios, rather, it fits to
users that query frequently a KG,
plus do not want to spend time downloading and indexing dump files.
KBox bridge this cumbersome task, plus, it shift query execution to the
client, so no scalability issues.
BTW, if you want to work with Javascript you can also simple start an local
endpoint:

https://github.com/AKSW/KBox/blob/master/README.md#starting-a-sparql-endpoint


>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> KBox is an alternative to other existing architectures for publishing KB such 
> as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
> I should add that you can do federated query with KBox as as easier as you 
> can do with SPARQL endpoints.


OK, but I still fail to see what is the value of this? What's the reason why 
I'd want to use it rather than just start a Fuseki endpoint, or use 
linked-fragments?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Edgard Marx
Hoi Laura,

Thnks for the opportunity to clarify it.
KBox is an alternative to other existing architectures for publishing KB
such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
I should add that you can do federated query with KBox as as easier as you
can do with SPARQL endpoints.
Here an example:

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

You can use KBox either on JAVA API or command prompt.

best,

http://emarx.org

On Sat, Oct 28, 2017 at 1:16 PM, Laura Morales  wrote:

> > No, the idea is that each organization will have its own KNS, so users
> can add the KNS that they want.
>
> How would this compare with a traditional SPARQL endpoint + "federated
> queries", or with "linked fragments"?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev
Hi!

> The first part of the Turtle data stream seems to contain syntax errors
> for some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> ```text/turtle
> 
> 
> "1.0E-6"^^ .

I've added https://phabricator.wikimedia.org/T179228 to handle this.
geoPrecision is a float value and assigning decimal type to it is a
mistake. I'll review other properties to see if we don't have more of
this. Thanks for bringing it to my attention!


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> @Laura : you mean this list http://lov.okfn.org/lov.nq.gz ?
> I can download it !!
> 
> Which one ? Please send me the URL and I can fix it !!


Yes you can download it, but the nq file is broken. It doesn't validate because 
some URIs contains white spaces, and some triples have an empty subject (ie. 
<>).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> No, the idea is that each organization will have its own KNS, so users can 
> add the KNS that they want. 

How would this compare with a traditional SPARQL endpoint + "federated 
queries", or with "linked fragments"?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Ghislain ATEMEZING
@Laura : you mean this list http://lov.okfn.org/lov.nq.gz ? 
I can download it !! 

Which one ? Please send me the URL and I can fix it !!

Best,
Ghislain

Provenance : Courrier pour Windows 10

De : Laura Morales
Envoyé le :samedi 28 octobre 2017 11:24
À : wikidata@lists.wikimedia.org
Cc : Discussion list for the Wikidata project.
Objet :Re: [Wikidata] Wikidata HDT dump

> Thanks to report that. I remember one issue that I added here 
> https://github.com/pyvandenbussche/lov/issues/66


Yup, still broken! I've tried just now.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Edgard Marx
Hoi Ghislain,

On Sat, Oct 28, 2017 at 9:54 AM, Ghislain ATEMEZING <
ghislain.atemez...@gmail.com> wrote:

> Hello emarx,
> Many thanks for sharing KBox. Very interesting project!
>

thanks


> One question, how do you deal with different versions of the KB, like the
> case here of wikidata dump?
>

KBox works with the so called KNS (Knowledge Name Service) servers, so any
dataset publisher can have his own KNS.
Each dataset has its own KN (Knowledge Name) that is distributed over the
KNS (Knowledge Name Service).
E.g. wikidata dump is https://www.wikidata.org/20160801.


> Do you fetch their repo every xx time?
>

No, the idea is that each organization will have its own KNS, so users can
add the KNS that they want.
Currently all datasets available in KBox KNS are served by KBox team.
You can check all of them here kbox.tech, or using the command line (
https://github.com/AKSW/KBox#how-can-i-list-available-knowledge-bases).


> Also, for avoiding your users to re-create the models, you can pre-load
> "models" from LOV catalog.
>

We plan to share all LOD datasets in KBox, we are currently in discussing
this with W3C,
DBpedia might have its own KNS soon.
Regarding LOV catalog, you can help by just asking them to publish their
catalog in KBox.

best,

http://emarx.org


>
> Cheers,
> Ghislain
>
> 2017-10-27 21:56 GMT+02:00 Edgard Marx :
>
>> Hey guys,
>>
>> I don't know if you already knew about it,
>> but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...
>>
>> https://github.com/AKSW/KBox
>>
>> And yes, you can also use it to merge your graph with one of those
>>
>> https://github.com/AKSW/KBox#how-can-i-query-multi-bases
>>
>>  cheers,
>> 
>>
>>
>>
>> On Oct 27, 2017 21:02, "Jasper Koehorst" 
>> wrote:
>>
>> I will look into the size of the jnl file but should that not be located
>> where the blazegraph is running from the sparql endpoint or is this a
>> special flavour?
>> Was also thinking of looking into a gitlab runner which occasionally
>> could generate a HDT file from the ttl dump if our server can handle it but
>> for this an md5 sum file would be preferable or should a timestamp be
>> sufficient?
>>
>> Jasper
>>
>>
>> > On 27 Oct 2017, at 18:58, Jérémie Roquet  wrote:
>> >
>> > 2017-10-27 18:56 GMT+02:00 Jérémie Roquet :
>> >> 2017-10-27 18:51 GMT+02:00 Luigi Assom :
>> >>> I found and share this resource:
>> >>> http://www.rdfhdt.org/datasets/
>> >>>
>> >>> there is also Wikidata dump in HDT
>> >>
>> >> The link to the Wikidata dump seems dead, unfortunately :'(
>> >
>> > … but there's a file on the server:
>> > http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
>> > the link was missing the “.gz”)
>> >
>> > --
>> > Jérémie
>> >
>> > ___
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
>
> --
>
> "*Love all, trust a few, do wrong to none*" (W. Shakespeare)
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> Thanks to report that. I remember one issue that I added here 
> https://github.com/pyvandenbussche/lov/issues/66


Yup, still broken! I've tried just now.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Ghislain ATEMEZING
Hi,
+1 to not share the jrnl file !
I agree with Stas that it doesn’t seem a best practice to publish a specific 
journal file for a given RDF store (here for blazegraph). 
Regarding the size of that jrnl file, I remember having one project with almost 
500M for 1 billion triples (~ 1/2 size of disk of the dataset). 

Best,
Ghislain 


Provenance : Courrier pour Windows 10

De : Stas Malyshev
Envoyé le :samedi 28 octobre 2017 08:42
À : Discussion list for the Wikidata project.; Jasper Koehorst
Objet :Re: [Wikidata] Wikidata HDT dump

Hi!

> I will look into the size of the jnl file but should that not be
> located where the blazegraph is running from the sparql endpoint or
> is this a special flavour? Was also thinking of looking into a gitlab
> runner which occasionally could generate a HDT file from the ttl dump
> if our server can handle it but for this an md5 sum file would be
> preferable or should a timestamp be sufficient?

Publishing jnl file for Blazegraph may be not as useful as one would
think, because jnl file is specific for a specific vocabulary and
certain other settings - i.e., unless you run the same WDQS code (which
customizes some of these) of the same version, you won't be able to use
the same file. Of course, since WDQS code is open source, it may be good
enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much
compressed. Loading it takes a couple of days on reasonably powerful
machine, more on labs ones (I haven't tried to load full dump on labs
for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if
triples are using repeated URIs, probably more if they contain ton of
text data.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Ghislain ATEMEZING
Hi Laura,
Thanks to report that. I remember one issue that I added here 
https://github.com/pyvandenbussche/lov/issues/66 

Please shout out or flag an issue on Github! That will help on quality issue of 
different datasets published out there 

Best,
Ghislain 

Provenance : Courrier pour Windows 10

De : Laura Morales
Envoyé le :samedi 28 octobre 2017 10:12
À : wikidata@lists.wikimedia.org
Cc : Discussion list for the Wikidata project.
Objet :Re: [Wikidata] Wikidata HDT dump

> Also, for avoiding your users to re-create the models, you can pre-load 
> "models" from LOV catalog.

The LOV RDF dump is broken instead. Or at least it still was the last time I 
checked. And I don't broken in the sense of Wikidata, that is with some wrong 
types, I mean broken as it doesn't validate at all (some triples are broken).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> Also, for avoiding your users to re-create the models, you can pre-load 
> "models" from LOV catalog.

The LOV RDF dump is broken instead. Or at least it still was the last time I 
checked. And I don't broken in the sense of Wikidata, that is with some wrong 
types, I mean broken as it doesn't validate at all (some triples are broken).

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> @Wouter: As Stas said, you might report that error. I don't agree with Laura 
> who tried to under estimate that "syntax error". It's also about quality ;)


Don't get me wrong, I am all in favor of data quality! :) So if this can be 
fixed, it's better! The thing is, that I've seen so many datasets with these 
kind of type errors, that by now I pretty much live with them and I'm OK with 
these warnings (the triple is not broken after all, it's just not following the 
standards).


> @Laura: Do you have a different rdf2hdt program or the one in the GitHub of 
> HDT project ?


I just use https://github.com/rdfhdt/hdt-cpp compiled from the master branch. 
To verify data instead, I use RIOT (a CL tool from the Apache Jena package) 
like this `riot --validate file.nt`.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Ghislain ATEMEZING
Hello emarx,
Many thanks for sharing KBox. Very interesting project!
One question, how do you deal with different versions of the KB, like the
case here of wikidata dump? Do you fetch their repo every xx time?
Also, for avoiding your users to re-create the models, you can pre-load
"models" from LOV catalog.

Cheers,
Ghislain

2017-10-27 21:56 GMT+02:00 Edgard Marx :

> Hey guys,
>
> I don't know if you already knew about it,
> but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...
>
> https://github.com/AKSW/KBox
>
> And yes, you can also use it to merge your graph with one of those
>
> https://github.com/AKSW/KBox#how-can-i-query-multi-bases
>
>  cheers,
> 
>
>
>
> On Oct 27, 2017 21:02, "Jasper Koehorst"  wrote:
>
> I will look into the size of the jnl file but should that not be located
> where the blazegraph is running from the sparql endpoint or is this a
> special flavour?
> Was also thinking of looking into a gitlab runner which occasionally could
> generate a HDT file from the ttl dump if our server can handle it but for
> this an md5 sum file would be preferable or should a timestamp be
> sufficient?
>
> Jasper
>
>
> > On 27 Oct 2017, at 18:58, Jérémie Roquet  wrote:
> >
> > 2017-10-27 18:56 GMT+02:00 Jérémie Roquet :
> >> 2017-10-27 18:51 GMT+02:00 Luigi Assom :
> >>> I found and share this resource:
> >>> http://www.rdfhdt.org/datasets/
> >>>
> >>> there is also Wikidata dump in HDT
> >>
> >> The link to the Wikidata dump seems dead, unfortunately :'(
> >
> > … but there's a file on the server:
> > http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
> > the link was missing the “.gz”)
> >
> > --
> > Jérémie
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 

"*Love all, trust a few, do wrong to none*" (W. Shakespeare)
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Ghislain ATEMEZING
Hi,
@Wouter: As Stas said, you might report that error. I don't agree with Laura 
who tried to under estimate that "syntax error". It's also about quality ;)

Many thanks in advance !
@Laura: Do you have a different rdf2hdt program or the one in the GitHub of HDT 
project ? 

Best,
Ghislain 

Sent from my iPhone, may include typos

> Le 28 oct. 2017 à 08:50, Stas Malyshev  a écrit :
> 
> Hi!
> 
>> I wouldn't call these "syntax" errors, just "logical/type" errors.
>> It would be great if these could fixed by changing the correct type from 
>> decimal to float/double. On the other hand, I've never seen any medium or 
>> large dataset without this kind of errors. So I would personally treat these 
>> as warnings at worst.
> 
> Float/double are range-limited and have limited precision. Decimals are
> not. Whether it is important for geo precision, needs to be checked, but
> we could be hitting the limits of precision pretty quickly.
> 
> -- 
> Stas Malyshev
> smalys...@wikimedia.org
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev
Hi!

> The first part of the Turtle data stream seems to contain syntax errors
> for some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> ```text/turtle
> 
> 
> "1.0E-6"^^ .
> ```

Could you submit a phabricator task (phabricator.wikimedia.org) about
this? If it's against the standard it certainly should not be encoded
like that.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev
Hi!

> I will look into the size of the jnl file but should that not be
> located where the blazegraph is running from the sparql endpoint or
> is this a special flavour? Was also thinking of looking into a gitlab
> runner which occasionally could generate a HDT file from the ttl dump
> if our server can handle it but for this an md5 sum file would be
> preferable or should a timestamp be sufficient?

Publishing jnl file for Blazegraph may be not as useful as one would
think, because jnl file is specific for a specific vocabulary and
certain other settings - i.e., unless you run the same WDQS code (which
customizes some of these) of the same version, you won't be able to use
the same file. Of course, since WDQS code is open source, it may be good
enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much
compressed. Loading it takes a couple of days on reasonably powerful
machine, more on labs ones (I haven't tried to load full dump on labs
for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if
triples are using repeated URIs, probably more if they contain ton of
text data.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Laura Morales
> The first part of the Turtle data stream seems to contain syntax errors for 
> some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> Notice that scientific notation is not allowed in the lexical form of 
> decimals according to XML > Schema Part 2: 
> Datatypes[https://www.w3.org/TR/xmlschema11-2/#decimal].  (It is allowed in 
> floats and doubles.)  Is this a known issue or should I report this somewhere?

I wouldn't call these "syntax" errors, just "logical/type" errors.
It would be great if these could fixed by changing the correct type from 
decimal to float/double. On the other hand, I've never seen any medium or large 
dataset without this kind of errors. So I would personally treat these as 
warnings at worst.

@Wouter when you build the HDT file, could you please also generate the 
.hdt.index file? With rdf2hdt, this should be activated with the -i flag. Thank 
you again!

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales
> is it possible to store a weighted adjacency matrix as an HDT instead of an 
> RDF?
> 
> Something like a list of entities for each entity, or even better a list of 
> tuples for each entity.
> So that a tuple could be generalised with propoerties.

Sorry I don't know this, you would have to ask the devs. As far as I 
understand, it's a triplestore and that should be it...

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Jérémie Roquet
2017-10-27 18:56 GMT+02:00 Jérémie Roquet :
> 2017-10-27 18:51 GMT+02:00 Luigi Assom :
>> I found and share this resource:
>> http://www.rdfhdt.org/datasets/
>>
>> there is also Wikidata dump in HDT
>
> The link to the Wikidata dump seems dead, unfortunately :'(

Javier D. Fernández of the HDT team was very quick to fix the link :-)

One can contact them for support either on their forum or by email¹,
as they are willing to help the Wikidata community make use of HDT.

Best regards,

¹ http://www.rdfhdt.org/team/

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Jérémie Roquet
2017-10-27 18:56 GMT+02:00 Jérémie Roquet :
> 2017-10-27 18:51 GMT+02:00 Luigi Assom :
>> I found and share this resource:
>> http://www.rdfhdt.org/datasets/
>>
>> there is also Wikidata dump in HDT
>
> The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server:
http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
the link was missing the “.gz”)

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Jérémie Roquet
2017-10-27 18:51 GMT+02:00 Luigi Assom :
> I found and share this resource:
> http://www.rdfhdt.org/datasets/
>
> there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

-- 
Jérémie

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Luigi Assom
Laura, Wouter, thank you
I did not know about HDT

I found and share this resource:
http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

I am new to it:
is it possible to store a weighted adjacency matrix as an HDT instead of an
RDF?

Something like a list of entities for each entity, or even better a list of
tuples for each entity.
So that a tuple could be generalised with propoerties.

Here an example with one property, 'weight', and an entity 'x1' is
associated to a list of other entities, including itself.
x1 = [(w1, x1) ... (w1, xn)]




On Fri, Oct 27, 2017 at 6:13 PM, Laura Morales  wrote:

> > You can mount te jnl file directly to blazegraph so loading and indexing
> is not needed anymore.
>
> How much larger would this be compared to the Turtle file?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales
> You can mount te jnl file directly to blazegraph so loading and indexing is 
> not needed anymore.

How much larger would this be compared to the Turtle file?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Jasper Koehorst
You can mount te jnl file directly to blazegraph so loading and indexing is not 
needed anymore. 

Van: Laura Morales
Verzonden: vrijdag 27 oktober 2017 17:18
Aan: wikidata@lists.wikimedia.org
CC: Discussion list for the Wikidata project.
Onderwerp: Re: [Wikidata] Wikidata HDT dump

> Would it be an idea if HDT remains unfeasible to place the journal file of 
> blazegraph online?
> Yes, people need to use blazegraph if they want to access the files and query 
> it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's 
smaller in size? Simpler to use?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales
> Dear Laura, others,
> 
> If somebody points me to the RDF datadump of Wikidata I can deliver an
> HDT version for it, no problem. (Given the current cost of memory I
> do not believe that the memory consumption for HDT creation is a
> blocker.)

This would be awesome! Thanks Wouter. To the best of my knowledge, the most up 
to date dump is this one [1]. Let me know if you need any help with anything. 
Thank you again!

[1] https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

---
Cheers,
Wouter Beek.

Email: wou...@triply.cc
WWW: http://triply.cc
Tel: +31647674624


On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales  wrote:
> Hello everyone,
>
> I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
> already available Turtle dump [2]. HDT is a binary format to store RDF data, 
> which is pretty useful because it can be queried from command line, it can be 
> used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
> to store the same data. The problem is that it's very impractical to generate 
> a HDT, because the current implementation requires a lot of RAM processing to 
> convert a file. For Wikidata it will probably require a machine with 
> 100-200GB of RAM. This is unfeasible for me because I don't have such a 
> machine, but if you guys have one to share, I can help setup the rdf2hdt 
> software required to convert Wikidata Turtle to HDT.
>
> Thank you.
>
> [1] http://www.rdfhdt.org/[http://www.rdfhdt.org/]
> [2] 
> https://dumps.wikimedia.org/wikidatawiki/entities/[https://dumps.wikimedia.org/wikidatawiki/entities/]
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Ghislain ATEMEZING
@Wouter: See here https://dumps.wikimedia.org/wikidatawiki/entities/ ?
Nice idea LAura.

Ghislain

El vie., 27 oct. 2017 a las 17:21, Wouter Beek ()
escribió:

> Dear Laura, others,
>
> If somebody points me to the RDF datadump of Wikidata I can deliver an
> HDT version for it, no problem.  (Given the current cost of memory I
> do not believe that the memory consumption for HDT creation is a
> blocker.)
>
> ---
> Cheers,
> Wouter Beek.
>
> Email: wou...@triply.cc
> WWW: http://triply.cc
> Tel: +31647674624 <+31%206%2047674624>
>
>
> On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales  wrote:
> > Hello everyone,
> >
> > I'd like to ask if Wikidata could please offer a HDT [1] dump along with
> the already available Turtle dump [2]. HDT is a binary format to store RDF
> data, which is pretty useful because it can be queried from command line,
> it can be used as a Jena/Fuseki source, and it also uses
> orders-of-magnitude less space to store the same data. The problem is that
> it's very impractical to generate a HDT, because the current implementation
> requires a lot of RAM processing to convert a file. For Wikidata it will
> probably require a machine with 100-200GB of RAM. This is unfeasible for me
> because I don't have such a machine, but if you guys have one to share, I
> can help setup the rdf2hdt software required to convert Wikidata Turtle to
> HDT.
> >
> > Thank you.
> >
> > [1] http://www.rdfhdt.org/
> > [2] https://dumps.wikimedia.org/wikidatawiki/entities/
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
-- 
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Wouter Beek
Dear Laura, others,

If somebody points me to the RDF datadump of Wikidata I can deliver an
HDT version for it, no problem.  (Given the current cost of memory I
do not believe that the memory consumption for HDT creation is a
blocker.)

---
Cheers,
Wouter Beek.

Email: wou...@triply.cc
WWW: http://triply.cc
Tel: +31647674624


On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales  wrote:
> Hello everyone,
>
> I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
> already available Turtle dump [2]. HDT is a binary format to store RDF data, 
> which is pretty useful because it can be queried from command line, it can be 
> used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
> to store the same data. The problem is that it's very impractical to generate 
> a HDT, because the current implementation requires a lot of RAM processing to 
> convert a file. For Wikidata it will probably require a machine with 
> 100-200GB of RAM. This is unfeasible for me because I don't have such a 
> machine, but if you guys have one to share, I can help setup the rdf2hdt 
> software required to convert Wikidata Turtle to HDT.
>
> Thank you.
>
> [1] http://www.rdfhdt.org/
> [2] https://dumps.wikimedia.org/wikidatawiki/entities/
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales
> Would it be an idea if HDT remains unfeasible to place the journal file of 
> blazegraph online?
> Yes, people need to use blazegraph if they want to access the files and query 
> it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's 
smaller in size? Simpler to use?

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata HDT dump

2017-10-27 Thread Jasper Koehorst
Would it be an idea if HDT remains unfeasible to place the journal file of 
blazegraph online?
Yes, people need to use blazegraph if they want to access the files and query 
it but it could be an extra next to turtle dump?



> On 27 Oct 2017, at 17:08, Laura Morales  wrote:
> 
> Hello everyone,
> 
> I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
> already available Turtle dump [2]. HDT is a binary format to store RDF data, 
> which is pretty useful because it can be queried from command line, it can be 
> used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
> to store the same data. The problem is that it's very impractical to generate 
> a HDT, because the current implementation requires a lot of RAM processing to 
> convert a file. For Wikidata it will probably require a machine with 
> 100-200GB of RAM. This is unfeasible for me because I don't have such a 
> machine, but if you guys have one to share, I can help setup the rdf2hdt 
> software required to convert Wikidata Turtle to HDT.
> 
> Thank you.
> 
> [1] http://www.rdfhdt.org/
> [2] https://dumps.wikimedia.org/wikidatawiki/entities/
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata HDT dump

2017-10-27 Thread Laura Morales
Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the 
already available Turtle dump [2]. HDT is a binary format to store RDF data, 
which is pretty useful because it can be queried from command line, it can be 
used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space 
to store the same data. The problem is that it's very impractical to generate a 
HDT, because the current implementation requires a lot of RAM processing to 
convert a file. For Wikidata it will probably require a machine with 100-200GB 
of RAM. This is unfeasible for me because I don't have such a machine, but if 
you guys have one to share, I can help setup the rdf2hdt software required to 
convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/
[2] https://dumps.wikimedia.org/wikidatawiki/entities/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata