[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  I also see 
https://grafana.wikimedia.org/d/00264/wikidata-dump-downloads?orgId=1=5m=now-2y=now
 which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and 
delete many old dumps logs  and 
friends) and a pointer from a colleague).
  
  As I noted, there are some complications around the 200s, and I see from 
T280678 's pointer to source 
processing at 
https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/dumpDownloads.php#L12
 consideration for 206s and 200s. Future TODO in case we wanted to figure out 
how to deal with the different-sized 200s and apparent downloader utilities.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  ^ Update.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Looking at yesterday's downloads with a rudimentary grep we're not far from 
1K downloads, and that's just for the //latest-all// ones. That also doesn't 
consider mirrors.
  
stat1007:~$ zgrep wikidatawiki 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20231024.gz | grep 
latest-all | grep " 200 " | wc -l
  
  Now, it's good to keep in mind that some of these downloads are mirror jobs 
themselves, but looking at some of the source IPs it's clear that a good number 
of them also are not mirrors.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread Addshore
Addshore added a comment.


  @bking Would it be possible to get me access to an R2 bucket that is paid for 
by the WMF in some way?
  I'll happily continue my manual process of putting a JNL file in a bucket 
every few months for folk to use until the point that this is more automated?

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, Addshore
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-24 Thread bking
bking triaged this task as "Low" priority.
bking closed this task as "Resolved".
bking added a comment.


  In T347605#9249978 , 
@dr0ptp4kt wrote:
  
  > @bking just wanted to express my gratitude for the support on this ticket 
and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org 
 and T347647: 2023-09-18 
latest-all.ttl.gz WDQS dump `Fatal error munging RDF 
org.openrdf.rio.RDFParseException: Expected '.', found 'g'` 
. FWIW I do think it would be good 
to automate this. As a matter of getting to a functional WDQS local environment 
replete with BlazeGraph data, it would accelerate things a lot. I think my only 
reservations are that:
  
  
  
  > 1. It takes time to automate. Any rough guess on level of effort for that? 
I understand that'd inform relative prioritization against the large pile of 
other things.
  
  I appreciate your appreciation! I only wish I'd gotten something useful up in 
R2. For a truly reliable process, we'd need to implement @Addshore 's 
suggestions around "...checksum[ming] of the thing prior to all of the copying, 
and also a timestamp the thing was taken from."
  
  Unfortunately, because the initial process took a lot longer than expected 
and we have a lot of other things on our plate, this has been deprioritized for 
the time being. I'll try to work on it in my spare time, and we can revisit the 
discussion next quarter for sure. Sorry for the trouble!

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-16 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Good question - I meant the contrast with respect to the .ttl.gz dumps and 
everything that goes into munging and importing (in aggregate across all 
downloaders of those files) versus the same for if this was done with the .jnl 
where they don't have to munge and import. Napkin-mathsing it, the thought was 
that the savings on energy accrues about as soon as the 16 cores x 12 hours of 
compression time on the .jnl has been "saved" by people in aggregate not 
needing to run the import process (and I'm just waving away the client side 
decompression, which in a way technically happens twice for the .ttl.gz user 
but only once for the .jnl.zst user, and any other disk or network transfer 
pieces, as those are all close enough, I suppose).
  
  I'll go check on what stats may be readily available on dumps downloads.
  
  Good point on having a checksum and timestamp. Yeah, it would be nice to have 
it in an on-demand place without the need for extra data transfer!

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-16 Thread Addshore
Addshore added a comment.


  > The energy savings is possibly unclear, at least under current case (but 
that's partly because it's hard to know how much energy is being expended, 
which could be guessed at from number of dump downloads; not sure how easy it 
is to get those stats; this is different from the bandwidth transfer on 
Cloudflare R2).
  
  Are you talking about number of dump downloads of the JNL or of triples?
  I believe the only way to really tell # of dump downloads of JNL on R2 is to 
infer this from the bytes downloaded
  For example on my large JNL file I have had many connections / download 
starts, but only a few people downloaded the full file size.
  As the file size is so big the number is probably fairly accurate though.
  
  For my JNL file in the past 30 days...
  829 connection requests, 141 unique visitors
  8TB data served, so probably 6-7 full downloads in 30 days
  
  > Thinking ahead a little, we'd probably want to generalize anything so that 
it can take arbitrary .jnls, for example for split graphs.
  
  I think the generalized approach here should probably just be taking large 
files from WDQS land and dumping them into a space like R2
  Information that would be great to flow with that data would be a checksum of 
the thing prior to all of the copying, and also a timestamp the thing was taken 
from.
  
  One of the next steps I want to try along this journey is downloading the 
file from R2, and adding it to a volume image for an EC2 machine or compute 
machine on another cloud provider.
  This should also remove the "download" step from those that want to use this 
file in coud lands and provide a fairly instant experience on whatever hardware 
for "your own WDQS"

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, Addshore
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  @bking just wanted to express my gratitude for the support on this ticket and 
its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org 
 and T347647: 2023-09-18 
latest-all.ttl.gz WDQS dump `Fatal error munging RDF 
org.openrdf.rio.RDFParseException: Expected '.', found 'g'` 
. FWIW I do think it would be good 
to automate this. As a matter of getting to a functional WDQS local environment 
replete with BlazeGraph data, it would accelerate things a lot. I think my only 
reservations are that:
  
  1. It takes time to automate. Any rough guess on level of effort for that? I 
understand that'd inform relative prioritization against the large pile of 
other things.
  2. The energy savings is possibly unclear, at least under current case (but 
that's partly because it's hard to know how much energy is being expended, 
which could be guessed at from number of dump downloads; not sure how easy it 
is to get those stats; this is different from the bandwidth transfer on 
Cloudflare R2).
  
  However, I would probably err on the side of assuming that ultimately the 
automation will boost the technical communities' interest and ability to trial 
things locally (right now the barriers are somewhat prohibitive) and that the 
energy savings will roughly net out - ironically, if it attracts more people, 
they'll in the aggregate consume more energy, but they'll also be vastly more 
efficient energy-wise because they won't have to ETL, which takes a lot of 
compute resources. For potential reusers (e.g., Enterprise or other 
institutions) it might help smooth things along a bit, although this is mostly 
just my conjecture.
  
  Thinking ahead a little, we'd probably want to generalize anything so that it 
can take arbitrary `.jnl`s, for example for split graphs.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-06 Thread Gehel
Gehel moved this task from Incoming to In Progress on the Data-Platform-SRE 
board.
Gehel assigned this task to bking.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

WORKBOARD
  https://phabricator.wikimedia.org/project/board/6524/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, Gehel
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Addressing @Addshore's comment in T344905#9210122 
...
  
  > I think the amount of time taken to decompress the JNL file should also be 
taken into consideration on varying hardware if compression is being considered.
  
  Here's what I saw for performance:
  
/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst: 1265888788480 bytes

real219m10.733s
user29m51.350s
sys 12m53.425s
  
  This was on an i7-8700 CPU @ 3.20GHz. When I checked this with `top` it 
seemed to be using about 0.8-1.6 processor, but hovering around 1 processor, at 
any given time. `unzstd` doesn't support multiple processor decompress from 
what I see.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Drawing from your inspiration, I downloaded with `wget` overnight and the 
`sha1sum' now matches that from `wdqs1016`. Deflating now, will update with 
results.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-04 Thread bking
bking added a comment.


  Checksum does not match the version from `wdqs1016`, which is:
  
sha1sum wikidata.jnl.zst
e3197eb5177dcd1aa0956824cd8dc4afc2d8796c  wikidata.jnl.zst
  
  I also downloaded the file locally after putting it up in Cloudflare, which 
has a different checksum as well (`shasum` is a Mac utility which defaults to 
sha1 output):
  
shasum wikidata.jnl.zst
d9b3d3729a9a2dce3242e756807411f945cfd824  wikidata.jnl.zst
  
  And I'm also getting ` wikidata.jnl.zst : Decoding error (36) : Data 
corruption detected` .  Will try redownloading with a `wget` and hope for 
better results.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Here's the `sha1sum` for the latest file I had downloaded:
  
/mnt/x$ time sha1sum wikidata.jnl.zst
62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab  wikidata.jnl.zst

real77m16.215s
user8m39.726s
sys 2m42.932s

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  For me the first 300 GB of the file went really, really fast. But `axel` was 
dropping connections, similar to when I had downloaded the large 1 TB file. So 
this download took about 5 hours. I'm pretty sure it could be done in 1-3 
hours, though, if everything were working well.
  
  Now, I encountered an error, and this was reproducible with two separate 
downloads. @bking does a test on the file yield the same corrupted block 
detected warning for you by any chance if you download the the zst? What about 
if you do it with your already existing copy?
  
/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst : 649266 MB... wikidata.jnl.zst : Decoding error 
(36) : Corrupted block detected

real124m59.115s
user17m44.647s
sys 7m24.509s

/mnt/x $ ls -l wikidata.jnl.zst
-rwxrwxrwx 1 adam adam 342189138219 Oct  3 02:32 wikidata.jnl.zst
  
  I've kicked off a `sha1sum`, but this will take a while to run.
  
/mnt/x$ time sha1sum wikidata.jnl.zst

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-02 Thread bking
bking added a comment.


  A few notes on this process:
  
- I used zstd compression   to compress 
the JNL file, as it supposedly offers the best speed. I used `zstd -T0 -19 
wikidata.jnl` as my compression command (all cores, maximum compression), using 
`wdqs1016` as the host. Despite having 32 cores, I never saw load average go 
past 16 during the compression process. The compression process took ~12 hours.
  - I uploaded to Cloudflare R2 object storage using @Addshore 's rclone 
command as described here 

 . The upload process (with the command optimizations in the linked post) took 
about an hour.
  - I'm downloading the file using axel 
 over my 1Gbps post, using 
16 concurrent connections: `axel -a -n 16`, getting about ~320Mbps transfer 
speed.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-29 Thread Stashbot
Stashbot added a comment.


  Mentioned in SAL (#wikimedia-operations) [2023-09-29T16:22:46Z]  
bking@wdqs1016 depooling to compress JNL file T347605 


TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Stashbot
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-28 Thread bking
bking updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-28 Thread Maintenance_bot
Maintenance_bot added a project: Wikidata.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Maintenance_bot
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-28 Thread bking
bking updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Addshore, dr0ptp4kt, Aklapper, bking, AWesterinen, BTullis, Namenlos314, 
Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-28 Thread bking
bking updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Aklapper, bking, AWesterinen, BTullis, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-09-28 Thread bking
bking created this task.
bking added projects: Wikidata-Query-Service, Data-Platform-SRE.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  Blazegraph (the application that serves WDQS) stores all its data in a single 
JNL file. The WDQS file is very large (~1.2TB) so moving it on and off the 
hosts tends to be difficult (see T344732 
 and this blog post 

 .  )
  
  We've had to do this more than once, and my general rule is that if you have 
to do something more than twice, you need to automate it.
  
  Creating this ticket to:
  
  - Document the process of extracting a JNL file from a wdqs hosts
  - Solicit feedback from co-workers/community members, and make a decision on 
whether to automate this process. Note that this **does not** mean we'll 
constantly run this process like we do for the TTL dumps.
  
  Just that we'll have a ready-made script to run that starts with a JNL file 
on a wdqs server and ends up with a file that can be publicly downloaded.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking
Cc: Aklapper, bking, AWesterinen, BTullis, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org