On Thu, Jan 11, 2024 at 8:26 AM Wurgl wrote:
> 22 141G 22 31.6G0 0 748k 0 55:10:00 12:18:26 42:51:34 698k
> curl: (18) transfer closed with 118232009816 bytes remaining to read
There you have it: curl only got 22 GB, 118 GB are missing.
> Something does not like me very much :
I would probably open a task to have wget available in the kubernetes
cluster and another, low-priority one, for investigating why connection
gets dropped between toolforge and dumps.w.o
On Sat, 13 Jan 2024 at 08:42, Wurgl wrote:
> Hello!
>
> wget was the tool I was using with jsub-Environment,
Hello!
wget was the tool I was using with jsub-Environment, but wget is not
available any more in kubernetes (with toolforge jobs start …) :-(
$ webservice php7.4 shell
tools.persondata@shell-1705135256:~$ wget
bash: wget: command not found
Wolfgang
Am Sa., 13. Jan. 2024 um 02:20 Uhr schrieb
Gerhard said that for him the downloading job ran for about 12 hours. It
seems the connection was closed.
I wouldn't be surprised if this was facing a similar problem as
https://phabricator.wikimedia.org/T351876
With such long download time, it isn't that strange that there could be
connection err
Okay,
yesterday evening I did the following:
I started this script
##
#!/bin/bash
curl
https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
| bzip2 -d | tail -200
##
With this command:
tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --
Thanks for the link to your Wikipedia page, but can I also find the
php program itself somewhere? I now know that it focuses on two
properties, namely (P935 (Commons gallery), P373 (Commons category)
but what it does with them is not described.
regards, Gerhard
___
Hello Gerhard!
It is just used to build a database for checking de-wikipedia
commons/commonscat-Links … A long time ago someone asked for it.
https://de.wikipedia.org/wiki/Benutzer:Wurgl/Probleme_Commons
Wolfgang
Am Mi., 10. Jan. 2024 um 18:50 Uhr schrieb Gerhard Gonter :
> On Wed, Jan 10, 202
On Wed, Jan 10, 2024 at 6:19 PM Wurgl wrote:
> The relevant line is this one:
> curl -s
> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
> | bzip2 -d | php ~/dumps/wikidata_sitelinks.php
Btw, just out of curiosity, is wikidata_sitelinks
On Wed, Jan 10, 2024 at 6:19 PM Wurgl wrote:
> The relevant line is this one:
> curl -s
> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
> | bzip2 -d | php ~/dumps/wikidata_sitelinks.php
>
> Yes, I double-checked it on my machine at home
Hello Ariel!
It is not "my bzip2", it is bzip2 on tools-sgebastion-11 in the
toolserver-cloud … well, actually one of the servers which are used, when I
start a script within the kubernetes environment there (with php 7.4)
When you have an account there, you can look at:
/data/project/persondata/d
I would hazard a guess that your bz2 unzip app does not handle multistream
files in an appropriate way, Wurgl. The multistream files consist of
several bzip2-compressed files concatenated together; see
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps
for details. Try downlo
Gerhad: Thanks for the extra checks!
Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
ends with the right footer.
On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter wrote:
> On Fri, Jan 5, 2024 at 5:03 PM Wurgl wrote:
> >
> > Hello!
> >
> > I am having some unexpected mess
On Fri, Jan 5, 2024 at 5:03 PM Wurgl wrote:
>
> Hello!
>
> I am having some unexpected messages, so I tried the following:
>
> curl -s
> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
> | bzip2 -d | tail
>
> an got this:
>
> bzip2: Compress
Hello Wolfgang,
I am trying to repro your issue. The file is ~140gb so doing a `bzcat`
takes a long while. Will get back to you with the result.
For now, here is the sha1 hash of that file so that you can compare
against your local copy, see if it was corrupted in flight?
$ sha1sum wikidatawiki-
14 matches
Mail list logo