[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-14 Thread Gerhard Gonter
On Thu, Jan 11, 2024 at 8:26 AM Wurgl wrote: > 22 141G 22 31.6G0 0 748k 0 55:10:00 12:18:26 42:51:34 698k > curl: (18) transfer closed with 118232009816 bytes remaining to read There you have it: curl only got 22 GB, 118 GB are missing. > Something does not like me very much :

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-13 Thread Platonides
I would probably open a task to have wget available in the kubernetes cluster and another, low-priority one, for investigating why connection gets dropped between toolforge and dumps.w.o On Sat, 13 Jan 2024 at 08:42, Wurgl wrote: > Hello! > > wget was the tool I was using with jsub-Environment,

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-13 Thread Wurgl
Hello! wget was the tool I was using with jsub-Environment, but wget is not available any more in kubernetes (with toolforge jobs start …) :-( $ webservice php7.4 shell tools.persondata@shell-1705135256:~$ wget bash: wget: command not found Wolfgang Am Sa., 13. Jan. 2024 um 02:20 Uhr schrieb

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-12 Thread Platonides
Gerhard said that for him the downloading job ran for about 12 hours. It seems the connection was closed. I wouldn't be surprised if this was facing a similar problem as https://phabricator.wikimedia.org/T351876 With such long download time, it isn't that strange that there could be connection err

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Wurgl
Okay, yesterday evening I did the following: I started this script ## #!/bin/bash curl https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | tail -200 ## With this command: tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Gerhard Gonter
Thanks for the link to your Wikipedia page, but can I also find the php program itself somewhere? I now know that it focuses on two properties, namely (P935 (Commons gallery), P373 (Commons category) but what it does with them is not described. regards, Gerhard ___

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Wurgl
Hello Gerhard! It is just used to build a database for checking de-wikipedia commons/commonscat-Links … A long time ago someone asked for it. https://de.wikipedia.org/wiki/Benutzer:Wurgl/Probleme_Commons Wolfgang Am Mi., 10. Jan. 2024 um 18:50 Uhr schrieb Gerhard Gonter : > On Wed, Jan 10, 202

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Gerhard Gonter
On Wed, Jan 10, 2024 at 6:19 PM Wurgl wrote: > The relevant line is this one: > curl -s > https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 > | bzip2 -d | php ~/dumps/wikidata_sitelinks.php Btw, just out of curiosity, is wikidata_sitelinks

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Gerhard Gonter
On Wed, Jan 10, 2024 at 6:19 PM Wurgl wrote: > The relevant line is this one: > curl -s > https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 > | bzip2 -d | php ~/dumps/wikidata_sitelinks.php > > Yes, I double-checked it on my machine at home

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Wurgl
Hello Ariel! It is not "my bzip2", it is bzip2 on tools-sgebastion-11 in the toolserver-cloud … well, actually one of the servers which are used, when I start a script within the kubernetes environment there (with php 7.4) When you have an account there, you can look at: /data/project/persondata/d

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Ariel Glenn WMF
I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details. Try downlo

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Xabriel Collazo Mojica
Gerhad: Thanks for the extra checks! Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer. On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter wrote: > On Fri, Jan 5, 2024 at 5:03 PM Wurgl wrote: > > > > Hello! > > > > I am having some unexpected mess

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Gerhard Gonter
On Fri, Jan 5, 2024 at 5:03 PM Wurgl wrote: > > Hello! > > I am having some unexpected messages, so I tried the following: > > curl -s > https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 > | bzip2 -d | tail > > an got this: > > bzip2: Compress

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-09 Thread Xabriel Collazo Mojica
Hello Wolfgang, I am trying to repro your issue. The file is ~140gb so doing a `bzcat` takes a long while. Will get back to you with the result. For now, here is the sha1 hash of that file so that you can compare against your local copy, see if it was corrupted in flight? $ sha1sum wikidatawiki-