[Wikidata-bugs] [Maniphest] T215413: Image Classification Research and Development

2024-05-16 Thread dr0ptp4kt
dr0ptp4kt removed a project: Reading-Admin.

TASK DETAIL
  https://phabricator.wikimedia.org/T215413

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Miriam, dr0ptp4kt
Cc: dr0ptp4kt, fkaelin, AikoChou, Capankajsmilyo, Mholloway, Ottomata, Jheald, 
Cirdan, MoritzMuehlenhoff, CDanis, akosiaris, SandraF_WMF, Fuzheado, 
PDrouin-WMF, Krenair, d.astrikov, JoeWalsh, Nirzar, dcausse, fgiunchedi, 
JAllemandou, leila, Capt_Swing, mpopov, Nuria, DarTar, Halfak, Gilles, 
EBernhardson, MusikAnimal, Abit, elukey, diego, Cparle, Ramsey-WMF, Miriam, 
Isaac, me, Danny_Benjafield_WMDE, Mohamed-Awnallah, S8321414, KinneretG, 
Astuthiodit_1, YLiou_WMF, BeautifulBold, EChetty, lbowmaker, Suran38, BTullis, 
karapayneWMDE, Invadibot, GFontenelle_WMF, Ywats0ns, maantietaja, FRomeo_WMF, 
Peteosx1x, NavinRizwi, ItamarWMDE, Nintendofan885, Akuckartz, Dringsim, 
4748kitoko, Nandana, JKSTNK, Akovalyov, Abdeaitali, Lahi, Gq86, E1presidente, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, Tramullas, Acer, V4switch, 
LawExplorer, Salgo60, Avner, Silverfish, _jensen, rosalieper, Scott_WUaS, 
Susannaanas, Wong128hk, Jane023, terrrydactyl, Wikidata-bugs, Base, 
matthiasmullie, aude, Daniel_Mietchen, Dinoguy1000, Ricordisamoa, Wesalius, 
Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Matanya, Mbch331, 
jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T123349: EPIC: Article placeholders using wikidata

2024-05-16 Thread dr0ptp4kt
dr0ptp4kt removed a project: Reading-Admin.

TASK DETAIL
  https://phabricator.wikimedia.org/T123349

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: waldyrious, Lydia_Pintscher, Nasirkhan, Aklapper, StudiesWorld, Lucie, 
atgo, dr0ptp4kt, JKatzWMF, me, BeautifulBold, Suran38, Peteosx1x, NavinRizwi, 
cmadeo, SBisson, Wikidata-bugs, Dinoguy1000, jayvdb, Ricordisamoa
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt added a comment.


  I actually just added a link to 
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#See_also
 . Marking this here ticket as resolved after noticing it was still open.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Daniel_Mietchen, AndrewTavis_WMDE, dr0ptp4kt, dcausse, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T352538: [EPIC] Evaluate the impact of the graph split

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt closed subtask T355037: Compare the performance of sparql queries 
between the full graph and the subgraphs as Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T352538

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Daniel_Mietchen, Aklapper, Gehel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, AWesterinen, BeautifulBold, Suran38, karapayneWMDE, Invadibot, 
maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T363721: Show "small logo or icon" as fallback image in search

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt edited projects, added Wikidata; removed Discovery-Search (Current 
work).

TASK DETAIL
  https://phabricator.wikimedia.org/T363721

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, ChristianKl, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, NavinRizwi, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Amorymeltzer, Lahi, Gq86, GoranSMilovanovic, QZanden, 
KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Dinoguy1000, Mbch331, Jay8g, EBjune
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt claimed this task.
dr0ptp4kt added a comment.


  Thanks @RKemper ! These speed gains are welcome news. We should discuss in a 
near future meeting if there are any further actions. I can see how we may want 
to set the bufferCapacity to 10**0** for imports, whereas we may want to 
just continue running with a bufferCapacity of 10 once a node is in serving 
mode, but good topic for discussion.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Mirroring comment in T359062#9783010 
<https://phabricator.wikimedia.org/T359062#9783010>:
  
  > And for the second run in T362920: Benchmark Blazegraph import with 
increased buffer capacity (and other factors) 
<https://phabricator.wikimedia.org/T362920> we saw that this took about 3089 
minutes, or about 2.**15** days, for the scholarly article entity graph with 
the CPU governor change (described in T336443#9726600 
<https://phabricator.wikimedia.org/T336443#9726600> ) plus the bufferCapacity 
at 10**0** on wdqs2023.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  On the gaming-class 2018 desktop, although the `bufferCapacity` value at 
10**0** sped things up as described on this here ticket, application of the 
CPU governor change did not seem to have any additional bearing (it took 2.47 
days as compared to its previous record of 2.44). It's possible that the 
existing BIOS configuration of the gaming-class 2018 desktop (which was already 
set to a high performance mode) was already squeezing out optimal performance, 
for example, or something else about the processor architecture's interaction 
with the rest of the hardware and operating system is just different as 
contrasted with the data center server. In any case, it's nice to see that the 
data center server is faster!
  
  One of my theories is that the gaming class desktop with 64GB of total RAM 
may play some role, but the hardware provider has indicated that although more 
memory can be installed, it will only run with 64GB RAM and can't jump to 128GB 
RAM. Another is that perhaps the default memory swappiness (60) on the gaming 
class desktop could play a role. However, I find this less likely, as memory 
spikes haven't seemed to be a problem on this machine while loading data, plus 
the hard drive is an NVMe and so paging is somewhat less likely to manifest 
problematically anyway. Maybe something to check another day, as we use a 
swappiness of 0 in the data center generally as with the WDQS hosts.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  And for the second run in T362920: Benchmark Blazegraph import with increased 
buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> 
we saw that this took about 3089 minutes, or about 2.**15** days, for the 
scholarly article entity graph with the CPU governor change (described in 
T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> ) plus the 
bufferCapacity at 10**0** on wdqs2023.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-07 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  In T362920#9776418 <https://phabricator.wikimedia.org/T362920#9776418>, 
@RKemper wrote:
  
  > @dr0ptp4kt
  >
  >> we saw that this took about 3702 minutes, or about 2.57 //hours//
  >
  > Typo you'll want to fix here and in the original: 2.57 **days**
  
  I think this is what is referred to as wishful thinking! Okay, updated the 
comment in the other ticket's comment and in the comment up above.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-06 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Mirroring comment in T359062#9775908 
<https://phabricator.wikimedia.org/T359062#9775908>:
  
  > In T362920 <https://phabricator.wikimedia.org/T362920>: Benchmark 
Blazegraph import with increased buffer capacity (and other factors) we saw 
that this took about 3702 minutes, or about 2.57 hours, for the scholarly 
article entity with the CPU governor change (described in T336443#9726600 
<https://phabricator.wikimedia.org/T336443#9726600> ) alone on wdqs2023.
  
  The count matches T359062#9695544 
<https://phabricator.wikimedia.org/T359062#9695544>.
  
select (count(*) as ?ct)
where {?s ?p ?o}

7643858078

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-06 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  In T362920: Benchmark Blazegraph import with increased buffer capacity (and 
other factors) <https://phabricator.wikimedia.org/T362920> we saw that this 
took about 3702 minutes, or about 2.57 hours, for the scholarly article entity 
with the CPU governor change (described in T336443#9726600 
<https://phabricator.wikimedia.org/T336443#9726600> ) alone on wdqs2023.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-02 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Another thing that can be nice for figuring out stuff later is to add some 
timing and a simple log file. A command like the following was helpful when I 
was trying this out on the gaming-class desktop (you may not need this if your 
tmux session lets you scroll back really far, but it's kind of nice for tailing 
even without tmux).
  
date | tee loadData.log; time ./loadData.sh -n wdq -d 
/mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2>&1 | tee -a loadData.log; time 
./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol 2>&1 | tee -a 
loadData.log

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, 
LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, 
Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, 
Akuckartz, Dringsim, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, 
Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-02 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  @RKemper I think that's captured in P54284 
<https://phabricator.wikimedia.org/P54284> . If you need to get a copy of the 
files, there's a pointer in T350106#9381611 
<https://phabricator.wikimedia.org/T350106#9381611> for how one might go about 
copying from HDFS to the local filesystem and then there's other stuff in the 
rest of the ticket about the data transfer. I kept a copy of the files at 
`stat1006:/home/dr0ptp4kt/gzips/nt_wd_schol` so those should be ready to be 
copied over if that helps at all.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, 
LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, 
Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, 
Akuckartz, Dringsim, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, 
Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt added a project: Wikidata.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt renamed this task from "Benchmark Blazegraph import with increased 
buffer capacity" to "Benchmark Blazegraph import with increased buffer capacity 
(and other factors)".

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, dr0ptp4kt, AWesterinen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt created this task.
dr0ptp4kt added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  In T359062: Assess Wikidata dump import hardware 
<https://phabricator.wikimedia.org/T359062> there's compelling evidence that 
increasing buffer capacity for import, that is to say updating 
RWStore.properties 
<https://gerrit.wikimedia.org/g/operations/puppet/+/3038e3b156c743c986d2f032a9810272138da9e2/modules/query_service/templates/RWStore.common.properties.erb#26>
 for a value of `com.bigdata.rdf.sail.bufferCapacity=100`, leads to a 
material performance improvement, as observed on a gaming-class desktop.
  
  This task is to request that we soon verify on a WDQS node in the data 
center, preferably ahead of any further imports with changed graph split 
definitions.
  
  At this point it seems clear that CPU speed, disk speed, and the buffer 
capacity make a meaningful difference in import time.
  
  Proposed:
  
  Using the `scholarly_articles` split files, on wdqs2024, run imports as 
follows.
  
  1. With the CPU performance governor configuration applied as described in 
T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> and with 
the existing default `RWStore.properties` configuration (which will have 
`com.bigdata.rdf.sail.bufferCapacity=10`, note this is 100_000). This will 
let us better understand for the R450 
<https://phabricator.wikimedia.org/diffusion/EPRO/> setup if the performance 
benefits for the performance governor configuration (sort of an analog of a 
faster processor like what we've seen with a gaming-class desktop) extend to 
this bulk ingestion routine. We could compare against results from 
T350465#9405888 <https://phabricator.wikimedia.org/T350465#9405888> .
  2. Then, still with the CPU performance governor configuration in place, 
using a RWStore.properties with a value of 
`com.bigdata.rdf.sail.bufferCapacity=100` (note this is 1_000_000). This 
will let us verify that for this hardware class the performance benefits are 
further extended.
  3. If and when a high speed NVMe is installed onto wdqs2024 (T361216), with 
both the CPU performance governor and higher buffer capacity pieces in place. 
This will let us verify that for this hardware class the performance benefits 
are even further extended.
  
  We had used wdqs**1**02**4** for the main graph ("non-scholarly") import 
before, and note the request here is to do the scholarly article graph import 
on wdqs202**4**. This is mainly because we have an NVMe request in flight for 
it.

TASK DETAIL
  https://phabricator.wikimedia.org/T362920

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, dr0ptp4kt, AWesterinen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-16 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **Running time**
  Total Uptime: 55 min
  
  This was faster than in T347989#9335980 
<https://phabricator.wikimedia.org/T347989#9335980>. Nice!
  
  **Counts**
  
  To be discussed in code review.
  
  **Samples **
  
  These look similar to about what we'd expect based on T347989#9346038 
<https://phabricator.wikimedia.org/T347989#9346038> .
  
select "| " || concat_ws(" | ", subject, predicate, object, context) from 
dr0ptp4kt.wikibase_rdf_scholarly_split_t362060 where snapshot = '20231016' and 
wiki = 'wikidata' and
scope = 'scholarly_articles' and rand() <= (30/7643858365) distribute by 
rand() sort by rand() limit 30;
  
  {icon graduation-cap}
  
  | subject 
| predicate| 
object | 
context  |
  | 
---
 |  | 
-- | 
 |
  | 
http://www.wikidata.org/entity/statement/Q46815762-E3F8B9BE-32CC-4055-9097-0732A1D7E88E
 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type  | 
http://wikiba.se/ontology#BestRank | 
http://www.wikidata.org/entity/Q46815762 |
  | http://www.wikidata.org/reference/c2c805e274b6709d71ffd08402ed14a95ddc0f48  
| http://www.wikidata.org/prop/reference/P248  | 
http://www.wikidata.org/entity/Q180686 | 
http://wikiba.se/ontology#Reference  |
  | http://www.wikidata.org/entity/Q93646519
| http://schema.org/description| 
"1985\u5E74\u306E\u8AD6\u6587"@ja  | 
http://www.wikidata.org/entity/Q93646519 |
  | http://www.wikidata.org/entity/Q82929879
| http://wikiba.se/ontology#sitelinks  | 
"0"^^http://www.w3.org/2001/XMLSchema#integer  | 
http://www.wikidata.org/entity/Q82929879 |
  | http://www.wikidata.org/reference/698fdc9c32c9033280837148dd0cc2fbb09701b6  
| http://www.wikidata.org/prop/reference/P248  | 
http://www.wikidata.org/entity/Q229883 | 
http://wikiba.se/ontology#Reference  |
  | 
http://www.wikidata.org/entity/statement/Q37398018-08548343-257C-43E8-8768-1B82B012B857
 | http://www.w3.org/ns/prov#wasDerivedFrom | 
http://www.wikidata.org/reference/1312ec06258ac7841e5e97d5b1d85cc034da666b | 
http://www.wikidata.org/entity/Q37398018 |
  | 
http://www.wikidata.org/entity/statement/Q38261165-38825DC4-B1CA-4102-8CCE-2B4713882EED
 | http://wikiba.se/ontology#rank   | 
http://wikiba.se/ontology#NormalRank   | 
http://www.wikidata.org/entity/Q38261165 |
  | 
http://www.wikidata.org/entity/statement/Q50247650-2B75A590-C865-4CD7-8E93-C5720E77B459
 | http://www.wikidata.org/prop/statement/P31   | 
http://www.wikidata.org/entity/Q13442814   | 
http://www.wikidata.org/entity/Q50247650 |
  | 
http://www.wikidata.org/entity/statement/Q56638632-3EEB814A-C402-48D4-9577-B91996287EDD
 | http://wikiba.se/ontology#rank   | 
http://wikiba.se/ontology#NormalRank   | 
http://www.wikidata.org/entity/Q56638632 |
  | 
http://www.wikidata.org/entity/statement/Q93198245-A9EF6F3A-AE60-4B68-9ADF-03861F92E7D2
 | http://www.w3.org/ns/prov#wasDerivedFrom | 
http://www.wikidata.org/reference/c40456cccbdf1b0dbf4590fad9ace45a270e3af6 | 
http://www.wikidata.org/entity/Q93198245 |
  | 
http://www.wikidata.org/entity/statement/Q35798201-73FA43B1-DE81-4AB8-84A1-435A776AFBF8
 | http://www.wikidata.org/prop/statement/P50   | 
http://www.wikidata.org/entity/Q55071316   | 
http://www.wikidata.org/entity/Q35798201 |
  | 
http://www.wikidata.org/entity/statement/Q46675214-E205C68E-FD35-4F3B-99F6-CEF31C772C1E
 | http://www.wikidata.org/prop/qualifier/P1545 | "2"   
 | 
http://www.wikidata.org/entity/Q46675214 |
  | 
http://www.wikidata.org/entity/statement/Q40608211-C59EE5EA-2F96-47C2-AE41-7EBEB83583F5
 | http://wikiba.se/ontology#rank   | 
http://wikiba.se/ontology#NormalRank   | 
http://www.wikidata.org/entity/Q40608211 |
  | 
http://ww

[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-16 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  I kicked off a run using the current version of the patch with the following 
command and backing table, and its status should be able to be followed here: 
https://yarn.wikimedia.org/cluster/app/application_1713178047802_16409
  
  So long as I haven't made an error somewhere in here that produces a runtime 
exception (e.g., pathing), we should be able to see after a couple hours how 
it's going.
  
spark3-submit --master yarn --driver-cores 2 --conf 
spark.sql.autoBroadcastJoinThreshold=-1 --conf 
spark.dynamicAllocation.maxExecutors=128 --conf 
spark.sql.shuffle.partitions=512 --conf spark.executor.memoryOverhead=4g 
--executor-cores 4 --executor-memory 12g --driver-memory 16g  --name 
scholarly_article_split_manual__scholarly_article_split_triples__T362060_personal_namespace
  --conf spark.yarn.maxAppAttempts=1 --class 
org.wikidata.query.rdf.spark.transform.structureddata.dumps.ScholarlyArticleSplit
 --deploy-mode cluster 
/home/dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies-T362060.jar
 --input-table-partition-spec 
discovery.wikibase_rdf_t337013/date=20231016/wiki=wikidata 
--output-table-partition-spec 
dr0ptp4kt.wikibase_rdf_scholarly_split_T362060/snapshot=20231016/wiki=wikidata
  
  Here was the manual table creation I did while `use`ing the `dr0ptp4kt` 
namespace.
  
CREATE TABLE IF NOT EXISTS dr0ptp4kt.wikibase_rdf_scholarly_split_T362060 (
  `subject` string,
  `predicate` string,
  `object` string,
  `context` string
)
PARTITIONED BY (
`snapshot` string,
`wiki` string,
`scope` string
)
STORED AS PARQUET
LOCATION 
'hdfs://analytics-hadoop/user/dr0ptp4kt/wikibase_rdf_scholarly_split_T362060/wikidata/rdf_scholarly_split_T362060/'
;

TASK DETAIL
  https://phabricator.wikimedia.org/T362060

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, 
Themindcoder, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, 
Astuthiodit_1, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, 
Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, 
joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, 
Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-10 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Good news. With the N-triples style scholarly entity graph files, with a 
buffer capacity of 10**0**, a write retention queue capacity of 4000, and a 
heap size of 31g, on the gaming-class desktop, it took about 2.40 days. Recall 
that with buffer capacity of 10 it took about 3.25 days on this desktop 
(and again, recall that it was 5.875 days on wdqs1024). So, there was about a 
35% (1.35 minus 1) speed increase with the higher buffer capacity here on this 
gaming-class desktop.
  
  It appears then that the combination of faster CPU, NVMe, and a higher buffer 
capacity is somewhere around 144% (5.875 / 2.40 = 2.44, 2.44 minus 1 = 1.44) 
faster than what we observed on a target data center machine.
  
  It will likely be somewhat less dramatic on 10B triples if the previous 
munged file runs are any clue. I'm going to think on how to check this notion - 
it could be done by using the scholarly graph plus a portion of the main graph, 
which would be probably close enough for our purposes.
  
  A high speed NVMe is in the process of being acquired so that we can verify 
on wdqs2024 the level of speedup achieved on a server similar to what was used 
for the graph split test servers. wdqs2024 has a hardware profile similar to 
wdqs1024 at present.
  
  Some stuff from the terminal from the import on the gaming-class desktop:
  
ubuntu22:~$ head -9 ~/rdf/dist/target/service-0.3.138-SNAPSHOT/loadData.log
Sun Apr  7 12:03:19 PM CDT 2024
Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=64069ms, elapsed=64024ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=71897ms, commitTime=1712509470732, 
mutationCount=7349689Sun Apr  7 12:04:31 PM CDT 2024
Processing part-1-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz

# screen output at the end:

Processing part-01023-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=51703ms, elapsed=51703ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=181013ms, commitTime=1712716306763, 
mutationCount=7946575Tue Apr  9 09:31:50 PM CDT 2024
File 
/mnt/firehose/split_0/nt_wd_schol/part-01024-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
 not found, terminating

real3447m18.542s

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Update: With the buffer capacity at 10**0**, file number 550 of the 
scholarly graph was imported as of `Mon Apr  8 03:22:08 PM CDT 2024` . So, 
under 28 hours so far (buffer capacity at 10 was more than 36 hours).
  
Processing part-00550-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=51018ms, elapsed=51018ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=245278ms, commitTime=1712607725882, 
mutationCount=7414497Mon Apr  8 03:22:08 PM CDT 2024
  
  Will update when it completes.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T361246: scap deploy should not repool a wdqs node that is depooled

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T361246

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T361935

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Daniel_Mietchen, dr0ptp4kt, pfischer, dcausse, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T361950: Ensure that WDQS query throttling does not interfere with federation

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T361950

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Daniel_Mietchen, Aklapper, dcausse, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, 
Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work).

TASK DETAIL
  https://phabricator.wikimedia.org/T362060

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T361114: Alert Search Platform and/or DPE SRE when Wikidata is lagged

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt set the point value for this task to "2".

TASK DETAIL
  https://phabricator.wikimedia.org/T361114

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Lucas_Werkmeister_WMDE, dcausse, Aklapper, bking, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, 
Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-07 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  With bufferCapacity at 10**0**, kicked it off again with the scholarly 
article entity graph files:
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date | tee 
loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 
0 -e 0 2>&1 | tee -a loadData.log; time ./loadData.sh -n wdq -d 
/mnt/firehose/split_0/nt_wd_schol 2>&1 | tee -a loadData.log
Sun Apr  7 12:03:19 PM CDT 2024
Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-07 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Update. On the gaming-class machine it took about 3.25 days to import the 
scholarly article entity graph, using a buffer capacity of 10 (compare this 
with 5.875 days on wdqs1024 
<https://phabricator.wikimedia.org/T350465#9405888>). This resulted in 
7_643_858_078 triples as expected. Next up will be with a buffer capacity of 
10**0** to see if there is any obvious difference in import time.
  
>Sun Apr  7 03:34:59 AM CDT 2024
Processing part-01023-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=181901ms, elapsed=181901ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=226511ms, commitTime=1712479122009, 
mutationCount=7946575Sun Apr  7 03:38:46 AM CDT 2024
File 
/mnt/firehose/split_0/nt_wd_schol/part-01024-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
 not found, terminating

real4684m49.905s

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Just updating on how far along this run is, file 550 of the scholarly article 
entity side of the graph is being processed. There are files 0 through 1023 for 
this side of the graph. Note that I did think to `tee` output this time around 
so that generally/hopefully there's more info available to review output, stack 
traces (although hopefully there are none), and so on, should it be needed.
  
Processing part-00549-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=299675ms, elapsed=299675ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=392531ms, commitTime=1712329890306, 
mutationCount=7032172Fri Apr  5 10:11:32 AM CDT 2024
Processing part-00550-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
  
  Sidebar: the "non"-scholarly article entity graph also has files 0-1023 and 
is similarly sized in terms of triples, but naturally the manner in which nodes 
are interconnected varies in a sense because of the type of entities, what kind 
of data entities are imbued with, and so on.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-04 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Following roughly the procedure in P54284 
<https://phabricator.wikimedia.org/P54284> to rename the Spark-produced graph 
files (and updating `loadData.sh` with 
`FORMAT=part-%05d-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz` and still 
having a `date` call after each `curl` in it), I kicked off an import of the 
scholarly article entity graph like so to see how it goes with a buffer 
capacity of 10:
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date; time 
./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2>&1 | tee 
loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol 
2>&1 | tee -a loadData.log
Wed Apr  3 09:32:54 PM CDT 2024
Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=55629ms, elapsed=55584ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=61598ms, commitTime=1712198035155, 
mutationCount=7349689Wed Apr  3 09:33:56 PM CDT 2024

real1m1.702s
user0m0.004s
sys 0m0.006s
Processing part-1-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=61251ms, elapsed=61251ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=71925ms, commitTime=1712198106800, 
mutationCount=7774048Wed Apr  3 09:35:08 PM CDT 2024
Processing part-2-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz
  
  This is with the following values in `RWStore.properties`
  
com.bigdata.btree.writeRetentionQueue.capacity=4000
com.bigdata.rdf.sail.bufferCapacity=10
  
  and the following variable in `loadData.sh`
  
HEAP_SIZE=${HEAP_SIZE:-"31g"}

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  This morning of April 3 around 6:25 AM I had SSH'd to check progress, and it 
was working, but going slowly, similar to the day before. It was on a file 
number in the 1200s, but I didn't write down the number or copy terminal 
output; I do remember seeing it was taking around 796 seconds for one of the 
files at that time. Look at the previous comment, you'll see those were going 
slow; not surprising as we know imports on these munged files are slower upon 
more stuff is imported.
  
  I checked several hours later in the middle of a meeting, and it had gone 
into a bad spiral.
  
  I've been able to use `screen` backscrolling to obtain much of the stack 
trace, but could not backscroll to a point of having all of the information to 
tell when the last successful file imported without a stack trace for sure. 
What we can say is that //probably// the last somewhat stable commit was on 
file 1302 at about 7:24 AM. And probably file 1303 and definitely 1304 and 1305 
have been failing badly and taking a really long time in doing so; this would 
probably continue indefinitely from here without killing the process. Just a 
slice of the paste here to give an idea of things (notice `lastCommitTime` and 
`commitCounter` in the stack trace).
  
Wed Apr  3 02:05:26 PM CDT 2024
Processing wikidump-01305.ttl.gz
SPARQL-UPDATE: updateStr=LOAD 

java.util.concurrent.ExecutionException: 
java.util.concurrent.ExecutionException: 
org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem 
with entry at -83289912769511002: lastRootBlock=rootBlock{ rootBlock=0, 
challisField=1302, version=3, nextOffset=47806576684846562, 
localTime=1712147044389 [Wedne
sday, April 3, 2024 7:24:04 AM CDT], firstCommitTime=1711737574896 [Friday, 
March 29, 2024 1:39:34 PM CDT], lastCommitTime=1712147041973 [Wednesday, April 
3, 2024
 7:24:01 AM CDT], commitCounter=1302, 
commitRecordAddr={off=NATIVE:-140859033,len=422}, 
commitRecordIndexAddr={off=NATIVE:-93467508,len=220}, blockSequence=34555,
 quorumToken=-1, metaBitsAddr=26754033649714513, metaStartAddr=11989126, 
storeType=RW, uuid=f993598d-497c-46a7-8434-d25c8859a0b8, offsetBits=42, 
checksum=16003356
92, createTime=1711737574192 [Friday, March 29, 2024 1:39:34 PM CDT], 
closeTime=0}
  
  Unfortunately `jstack` seems to hiccup.
  
ubuntu22:~$ sudo jstack -m 987870
[sudo] password: 
Attaching to process ID 987870, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.402-b06
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
at sun.tools.jstack.JStack.main(JStack.java:106)
Caused by: java.lang.RuntimeException: Unable to deduce type of thread from 
address 0x7fecb400b800 (expected type JavaThread, CompilerThread, 
ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
at 
sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:169)
at sun.jvm.hotspot.runtime.Threads.first(Threads.java:153)
at sun.jvm.hotspot.tools.PStack.initJFrameCache(PStack.java:200)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:71)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:58)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:53)
at sun.jvm.hotspot.tools.JStack.run(JStack.java:66)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
... 6 more
Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for 
type of address 0x7fecb400b800
at 
sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62)
at 
sun.jvm.hotspot.runtime.VirtualConstructor.instantiateWrapperFor(VirtualConstructor.java:80)
at 
sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:165)
... 16 more
ubuntu22:~$ sudo jstack -Flm 987870
Usage:
jstack [-l] 
(to connect to running process)
jstack -F [-m] [-l] 
(to connect to a hung process)
jstack [-m] [-l]  
(to connect to a core file)
jstack [-m] [-l] [server_id@]
(to connect to a remote debug server)

Options:
-F  to force a thread dump. Use when jstack  does not respond 
(process is hung)
-m  to print both java and native frames (

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-02 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Now this is interesting: we're now past 4 days (about 4 days and 1 hour) of 
this running, and with buffer capacity at 10 instead of 10**0** (but 
this time without any gap between the batches of files), there's still a good 
way to go yet.
  
Processing wikidump-01177.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=612796ms, elapsed=612796ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=689208ms, commitTime=1712085811545, 
mutationCount=12297407Tue Apr  2 02:23:35 PM CDT 2024
Processing wikidump-01178.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=850122ms, elapsed=850121ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=950693ms, commitTime=1712086762086, 
mutationCount=16659867Tue Apr  2 02:39:26 PM CDT 2024
Processing wikidump-01179.ttl.gz
  
  It's possible this means that a higher buffer capacity actually makes a 
difference. I will let this run complete so we can see what is the percentage 
difference.
  
  After this I will check if this sort of behavior is reproducible, and to what 
extent, with one side of the graph split when using these two different buffer 
sizes.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-01 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  The run with with buffer at 10**0** and heap size at 31g and queue 
capacity at 4000 on the gaming-class desktop completed.
  
Processing wikidump-01332.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13580ms, elapsed=13580ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=266483ms, commitTime=1711304860167, 
mutationCount=4772590Sun Mar 24 01:27:45 PM CDT 2024

real5690m30.371s
  
  ... which is 3.95 days. I'm trying again, but going back to the buffer 
capacity at 10 instead of 10**0** for one last comparison with these 
runs on this subset of munged data, and without any larger pause between 
batches of files (remember the previous run with buffer capacity at 10 and 
31g heap and queue capacity at 4000 was done by first running files 1-150, then 
after coming back to the terminal sometime later resumed from file 151; but in 
the real world we usually hope to just let this thing run one file after 
another without any pause...in practice it could be that allowing the JVM time 
to heal itself created some artificial speed gains, but we'll see).
  
  Starting on Friday, March 29, 2024 at 1:40 PM CT...
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 1332
Processing wikidump-1.ttl.gz
  
  I'll update when it's done. It should complete presumably sometime in the 
next 24 hours.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-21 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **AWS EC2 servers**
  
  After exploring a battery of EC2 servers, four instance types were selected 
and the commands posted were run.
  
  The configuration most like our `wdqs1021-1023` servers (third generation 
Intel Xeon) is listed first. The fastest option among the four servers was a 
Gravitron3 ARM-based configuration from Amazon.
  
  | Time Disk ➡️ Disk | Time RAMdisk ➡️ RAMdisk | Instance Type 
 | Cost Per Hour | HD Transfer | Processor 
Comment 


  | RAM Comment|
  | - | --- | 
-- | - 
| --- | 
---
 | -- |
  | 26m46.651s| 26m26.923s  | m6id 
<https://aws.amazon.com/ec2/instance-types/m6i/>.16xlarge | $3.7968   | EBS 
➡️ NVMe | 64 vCPU @ "Up to 3.5 GHz 3rd Generation Intel Xeon Scalable 
processors (Ice Lake 8375C)"

| 256 GB @ DDR4  |
  | 22m5.442s | 20m31.244s  | m5zn 
<https://aws.amazon.com/ec2/instance-types/m5/>.metal | $3.9641   | EBS 
➡️ EBS  | 48 vCPU @ ""2nd Generation Intel Xeon Scalable Processors (Cascade 
Lake 8252C) with an all-core turbo frequency up to 4.5 GHz""

 | 192 GiB @ DDR4 |
  | 21m40.537s| 20m57.268s  | c5d 
<https://aws.amazon.com/ec2/instance-types/c5/>.12xlarge   | $2.304| 
EBS ➡️ NVMe | 48 vCPU @ " C5 <https://phabricator.wikimedia.org/C5> and C5d 
12xlarge, 24xlarge, and metal instance sizes feature custom 2nd generation 
Intel Xeon Scalable Processors (Cascade Lake 8275CL) with a sustained all core 
Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz." | 
96 GiB @ DDR4  |
  | 19m18.825s| 19m23.868s  | c7gd 
<https://aws.amazon.com/ec2/instance-types/c7g/>.16xlarge | $2.903| EBS 
➡️ NVMe | 64 vCPU @ "Powered by custom-built AWS Graviton3 processors"  


  | 128 GiB @ DDR5 |
  |
  
  **2018 gaming desktop**
  
  Commands were then run against a a gaming-class desktop from 2018. This 
outperformed the fastest Gravitron3 configuration in AWS.
  
  The Blazegraph `bufferCapacity` configuration variable was tested. Increasing 
the `bufferCapacity` from 10 to 100 yielded a sizable performance 
improvement.
  
  | Time Disk ➡️ Disk | Instance Type   

| 
bufferCapacity | HD Transfer   | Processor Comment  

   | RAM Comment  |
  | - | 
---
 | -- | - | 
-
 |  |
  | 18m31.647s| Alienware Aurora R7 
<https://www.bestbuy.com/site/alienware-aurora-r7-gaming-desktop-intel-core-i7-8700-16gb-memory-nvidia-gtx-1070-1tb-hdd-intel-optane-memory/6155310.p?skuId=6155310>
 (upgraded) i7-8700 | 10 | SATA SSD ➡️ NVMe  | 6 CPU @ up to 4.6 
GHz (i7-8700 
<https://ark.intel.com/content/www/us/en/ark/products/126686/intel-core-i7-8700-processor-12m-cache-up-to-4-60-ghz.html>
 page) | 64 GB @ DDR4 |
  | 18m3.798s | Alienware Aurora R7 
<https://www.bestbuy.com/site/alienware-aurora-r7-gaming-desktop-intel-core-i7-8700-16gb-memory-nvidia-gtx-1070-1tb-hdd-intel-optane-memory/6155310.p?skuId=6155310>
 (upgraded) i7-8700 | 10 | NVMe ➡️ same NVMe | 6

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-21 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  By the way, I'm attempting a run for the first 1332 munged files (one shy of 
the 1333 where terminated last time around) with buffer at 10**0** and heap 
size at 31g and queue capacity at 4000 on the gaming-class desktop to see 
whether this imports smoothly and whether performance gains are noticeable.
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date
Wed Mar 20 02:36:59 PM CDT 2024
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 1332
  
  ...screen'ing in to check:
  
Processing wikidump-00505.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13452ms, elapsed=13452ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=167329ms, commitTime=1711041930967, 
mutationCount=4566497Thu Mar 21 12:25:35 PM CDT 2024
Processing wikidump-00506.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=15405ms, elapsed=15405ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=203202ms, commitTime=1711042135111, 
mutationCount=5262167Thu Mar 21 12:28:58 PM CDT 2024
Processing wikidump-00507.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=14701ms, elapsed=14700ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=178754ms, commitTime=1711042314114, 
mutationCount=5005853Thu Mar 21 12:31:57 PM CDT 2024

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-20 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  The run to check with heap size of 31g, queue capacity of 8000, and buffer at 
10**0** stalled at file 107.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-20 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Attempting a run with a **queue capacity of 8000** and buffer of 10**0** 
and heap size of 16g on the gaming-class desktop to mimic the MacBook Pro, 
things were slower than a queue capacity of 4000 and buffer of 100 and heap 
size of 31g on the gaming-class desktop 
<https://phabricator.wikimedia.org/T359062#9643972>.
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150
...
real280m46.264s
  
  A run is in progress to verify if there's anything noticeable when the heap 
size is set to 31g but the queue capacity is at 8000 and the buffer is at 
10**0** when processing the first 150 files on the gaming-class desktop.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **About Amazon Neptune**
  
  Amazon Neptune was set to import using the simpler N-Triples file format with 
its serverless configuration at 128 NCUs (about 256 GB of RAM with some 
attendant CPU). We don't use N-Triples files in our existing import process, 
but it is the sort of format used in the graph split imports.
  
curl -v -X POST \
-H 'Content-Type: application/json' \

https://db-neptune-1.cluster-cnim20k6c0mh.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
{
  "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
  "format" : "ntriples",
  "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3",
  "region" : "us-west-2",
  "failOnError" : "FALSE",
  "parallelism" : "HIGH",
  "updateSingleCardinalityProperties" : "FALSE",
  "queueRequest" : "TRUE"
}'
  
  This required a bunch of grants, and I had to make my personal bucket hosting 
the file listable and readable, as well as the objects listable and readable 
within it (it's possible to do chained IAM grants, but it is a bit of work and 
requires somewhat complicated STSes). It appeared that it was also necessary to 
create the VPC endpoint as described in the documentation.
  
  This was started at 1:30 PM CT on Monday, February 26, 2024. Note that this 
is the lexemes dump. I'm trying here to verify that with 128 NCUs it goes 
faster than with 32 NCUs. Because if it does, that will be useful for the 
bigger dump.
  
curl -v -X POST \
-H 'Content-Type: application/json' \

https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
{
  "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
  "format" : "ntriples",
  "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt",
  "region" : "us-west-2",
  "failOnError" : "FALSE",
  "parallelism" : "OVERSUBSCRIBE",
  "updateSingleCardinalityProperties" : "FALSE",
  "queueRequest" : "TRUE"
}'


{
"status" : "200 OK",
"payload" : {
"loadId" : "8ace45ed-2989-4fd4-aa19-d13b9a59e824"
}

curl -G 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/8ace45ed-2989-4fd4-aa19-d13b9a59e824'


{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_COMPLETED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://blazegraphdump/latest-lexemes.nt.bz2",
"runNumber" : 1,
"retryNumber" : 0,
"status" : "LOAD_COMPLETED",
"totalTimeSpent" : 2142,
"startTime" : 1708975752,
"totalRecords" : 163715491,
"totalDuplicates" : 141148,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
}
  
  Now, for the full Wikidata load. This was started at about 2:20 PM CT on 
Monday, February 26, 2024.
  
curl -v -X POST \
-H 'Content-Type: application/json' \

https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader
 -d '
{
  "source" : "s3://blazegraphdump/latest-all.nt.bz2",
  "format" : "ntriples",
  "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt",
  "region" : "us-west-2",
  "failOnError" : "FALSE",
  "parallelism" : "OVERSUBSCRIBE",
  "updateSingleCardinalityProperties" : "FALSE",
  "queueRequest" : "TRUE"
}'

{
"status" : "200 OK",
"payload" : {
"loadId" : "54dc9f5a-6e3c-428d-8897-180e10c96dbf"
}


curl -G 
'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf'
  
  As a frame of reference, over 9B records imported in a a bit over 26 hours. 
This is in the ball

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **Going for the full import**
  
  Further import commenced from there with a `bufferCapacity` of 10**0**:
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date
Mon Mar  4 06:31:06 PM CST 2024

ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 151 -e 2202
Processing wikidump-00151.ttl.gz
  
  Munge files 151 through 1333 were processed, stopping at Friday, March 8, 
2024 12:07:23 AM CST.
  
  So, we have about 4 hours for files 1-150, then another 77.6 hours for files 
151-1333. This means about 66% of the full dump was processed in about 3.5 days.
  
  As noted earlier, there may be an opportunity to set the queue capacity 
higher and squeeze out even better performance. This will need to wait until 
I'm physically at the gaming class desktop.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **More about bufferCapacity**
  
  Similarly, with 150 munged files, was attempted with the buffer in 
RWStore.properties increased from 10 to 10**0** with the target as the 
NVMe.
  
  com.bigdata.rdf.sail.bufferCapacity=100
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150
...
real240m5.344s
  
  Remember, for //nine// munged files the difference in performance for NVMe ➡️ 
same NVMe between the `bufferCapacity` of 10 versus about 10*0*  was 
about 34%. (~1.3412 minus 1.), and what we see here for //150// munged 
files is somewhat consistent at about 33%.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  **More about NVMe versus SSD**
  
  Runs were also done to see the effects on 150 munged files (out of a set of 
2202 files) from the full Wikidata import, which allows for exercising more 
disk related pieces. This was tried with both types of target disk - SATA SSD 
and M.2 NVMe - on the 2018 gaming desktop. This was done with the 
`bufferCapacity` of 10.
  
  The M.2 NVMe was faster, somewhere between 16%-19% faster.
  
  Notice in the following commands the paths
  
  - `~/rdf`, which is part of a mount on the NVMe
  - `/mnt/t`, which is a copy of `~/rdf`, but on a SATA SSD
  - `/mnt/firehose/`, yet another SATA SSD, bearing the full set of munged files
  
  **Target is NVMe**
  
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n 
wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150

...

>Processing wikidump-00150.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=33999ms, elapsed=33999ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=76005ms, commitTime=1709099819611, 
mutationCount=3098484
real319m50.828s
  
  **Target is SATA SSD, run attempt 1**
  
  Now, the SATA SSD as the target (as before, the source has been a different 
SATA SSD).
  
ubuntu22:/mnt/t/rdf/dist/target/service-0.3.138-SNAPSHOT$ time 
./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150

>Processing wikidump-00150.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=45665ms, elapsed=45665ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=114606ms, commitTime=1709141576293, 
mutationCount=3098484
real381m19.703s
  
  So, the SATA SSD as target yielded a result about 19% slower.
  
  **Target is SATA SSD, run attempt 2**
  
  The SATA SSD target was tried this again from the same directory (as always, 
first stopping Blazegraph and deleting the journal) again just to get a feeling 
of whether this wasn't a fluke on the SATA SSD.
  
ubuntu22:/mnt/t/rdf/dist/target/service-0.3.138-SNAPSHOT$ time 
./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150

>totalElapsed=46490ms, elapsed=46490ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=120472ms, commitTime=1709169683880, 
mutationCount=3098484
real373m52.079s

Still, some 16.5% slower on the SSD.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-08 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: ssingh.
dr0ptp4kt added a comment.


  @ssingh would you mind if the following command is run on one of the newer 
cp hosts with a new higher write throughput NVMe? If so, got a recommended 
node? I don't have access, but I think @bking may.
  
  `sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M count=1024; sudo sync`
  
  Heads up, I'm out for the rest of the day.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-08 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Thanks @bking ! It looks like the NVMe in this one is not a higher speed one 
for writes, and I'm also wondering if perhaps its write performance has 
degraded with age. I'll paste in the results here, but this was slower than the 
other servers, ironically (although not surprisingly because of the slower NVMe 
and slightly slower processor). This slower write speed is atypical of the 
other NVMes I've encountered. I believe the newer model ones are rated for 6000 
MB/s for writes. But, I'm going to ping on task to see if we can get a 
comparative read of disk throughput from one of the newer and faster cp 
NVMes.
  
dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache$ ls /srv/wdqs/
aliases.map  wikidata.jnl   wikidump-2.ttl.gz  
wikidump-4.ttl.gz  wikidump-6.ttl.gz  wikidump-8.ttl.gz
dumpswikidump-1.ttl.gz  wikidump-3.ttl.gz  
wikidump-5.ttl.gz  wikidump-7.ttl.gz  wikidump-9.ttl.gz
dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache$ cd cache
dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache/cache$ time 
./loadData.sh -n wdq -d /srv/wdqs -s 1 -e 9
Processing wikidump-1.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=214282ms, elapsed=214279ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=233942ms, commitTime=1709910647417, 
mutationCount=22829952Processing wikidump-2.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=196470ms, elapsed=196469ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=227786ms, commitTime=1709910874952, 
mutationCount=15807617Processing wikidump-3.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=183111ms, elapsed=183110ms, connFlush=0ms, 
batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=213965ms, commitTime=1709911089170, 
mutationCount=12654001Processing wikidump-4.ttl.gz
^C

real14m4.855s
user0m0.084s
sys 0m0.053s
dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache/cache$ cd /srv
dr0ptp4kt@wdqs1025:/srv$ df .
Filesystem  1K-blocksUsed  Available Use% Mounted on
/dev/nvme0n1   1537157352 9508448 1449491832   1% /srv
dr0ptp4kt@wdqs1025:/srv$ sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M 
count=1024; sudo sync
1024+0 records in
1024+0 records out
26843545600 bytes (27 GB, 25 GiB) copied, 27.1995 s, 987 MB/s
dr0ptp4kt@wdqs1025:/srv$ sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M 
count=1024; sudo sync
1024+0 records in
1024+0 records out
26843545600 bytes (27 GB, 25 GiB) copied, 37.5448 s, 715 MB/s
dr0ptp4kt@wdqs1025:/srv$ lsblk -o MODEL,SERIAL,SIZE,STATE --nodeps
MODELSERIAL SIZE STATE
...
Dell Express Flash PM1725a 1.6TB SFF   S39XNX0JC01060   1.5T

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  First, adding some commands that were used for Blazegraph imports on Ubuntu 
22.04. I had originally tried a good number of EC2 instance types, and then 
after that went back to focus on just four of them with a sequence of 
repeatable commands (this wasn't scripted, as I didn't want to spend time 
automating and also wanted to make sure I got the systems' feedback along the 
way). I forgot to grab RAM clock speed as a routine step when running these 
commands (I recall checking on one server maybe in the original checks, and did 
look at my Alienware), but generally servers are DDR4 unless the documentation 
in AWS says DDR5 (for my 2018 Alienware and 2019 MacBook Pro they're DDR4, BTW).
  
# get the specs, get the software, ready the mount
lscpu
free -h
lsblk
sudo fdisk /dev/nvme1n1
 n
 p
 1
 ENTER
 ENTER
 w
lsblk
sudo mkfs.ext4 /dev/nvme1n1p1
mkdir rdf
sudo mount -t auto -v /dev/nvme1n1p1 /home/ubuntu/rdf
sudo chown ubuntu:ubuntu rdf
git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf rdfdownload
cp -r rdfdownload/. rdf
cd rdf
df -h .
sudo apt update
sudo apt install openjdk-8-jdk-headless
./mvnw package -DskipTests

# ready Blazegraph and run a partial import
sudo mkdir /var/log/wdqs
sudo chown ubuntu:ubuntu /var/log/wdqs
touch /var/log/wdqs/wdqs-blazegraph.log
cd /home/ubuntu/rdf/dist/target/
tar xzvf service-0.3.138-SNAPSHOT-dist.tar.gz
cd service-0.3.138-SNAPSHOT/
# using logback.xml like prod:
mv ~/logback.xml .
# using runBlazegraph.sh like prod, 31g heap and pointer to logback.xml:
mv ~/runBlazegraph.sh .
vi runBlazegraph.sh
screen
 ./runBlazegraph.sh
CTRL-a-d to leave screen up
time ./loadData.sh -n wdq -d /home/ubuntu/ -s 1 -e 9
screen -r
CTRL-c to kill Blazegraph
 exit from screen
ls -alh wikidata.jnl
rm wikidata.jnl

# try it with a ramdisk
sudo modprobe brd rd_size=50331648 max_part=1 rd_nr=1
sudo mkfs -t ext4 /dev/ram0
mkdir /home/ubuntu/rdfram
sudo mount /dev/ram0 /home/ubuntu/rdfram
sudo chown ubuntu:ubuntu /home/ubuntu/rdfram
cd
cp -r rdf/. rdfram
cd rdfram/dist/target/service-0.3.138-SNAPSHOT/
cp /home/ubuntu/wikidump-* /home/ubuntu/rdfram
df -h ./
screen
 ./runBlazegraph.sh
CTRL-a-d to leave screen up
time ./loadData.sh -n wdq -d /home/ubuntu/rdfram -s 1 -e 9

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-06 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-03-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish 
with a hostname of `wdqs1025.eqiad.wmnet`? I'm wondering if maybe there's a 
direct IP or IPs given that there don't seem to be DNS records for 
`cp1086.eqiad.wmnet` or `cp1086.mgmt.eqiad.wmnet`?

TASK DETAIL
  https://phabricator.wikimedia.org/T358727

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: VRiley-WMF, dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, 
Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, BTullis, karapayneWMDE, 
joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, 
Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Davinaclare77, Techguru.pc, 
Lahi, Gq86, GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, LawExplorer, 
Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, 
Mbch331, Jay8g, fgiunchedi
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt moved this task from Incoming to Current work on the 
Wikidata-Query-Service board.
dr0ptp4kt removed a project: Wikidata-Query-Service.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

WORKBOARD
  https://phabricator.wikimedia.org/project/board/891/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, 
Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, AWesterinen, Namenlos314, 
Lucas_Werkmeister_WMDE, merbst, Jonas, Xmlizer, jkroll, Jdouglas, Tobias1984, 
Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt changed the task status from "Open" to "In Progress".
dr0ptp4kt triaged this task as "Medium" priority.
dr0ptp4kt claimed this task.
dr0ptp4kt added projects: Wikidata-Query-Service, Discovery-Search (Current 
work).
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T359062

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, AWesterinen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-03-01 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Thanks @VRiley-WMF ! @bking is up next for imaging, I think.

TASK DETAIL
  https://phabricator.wikimedia.org/T358727

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: VRiley-WMF, dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, 
Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, 
karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, 
ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, 
Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-02-29 Thread dr0ptp4kt
dr0ptp4kt added a parent task: T358533: Hardware requests for Search Platform 
FY2024-2025.

TASK DETAIL
  https://phabricator.wikimedia.org/T358727

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, 
Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, 
karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, 
ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, 
Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-02-29 Thread dr0ptp4kt
dr0ptp4kt added a parent task: T336443: Investigate performance differences 
between wdqs2022 and older hosts.

TASK DETAIL
  https://phabricator.wikimedia.org/T358727

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, 
Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, 
karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, 
ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, 
Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, 
Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  I summarized at 
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance
 . When we have a mailing list post during the next week or so, we'll want to 
move this to be a subpage of the target page of the post.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-02 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  In T355037#9508760 <https://phabricator.wikimedia.org/T355037#9508760>, 
@dcausse wrote:
  
  > @dr0ptp4kt thanks! is the difference in the number of successful queries 
only explained by the improvement in query time or are there some improvements 
in the number of queries that timeout as well?
  
  Good question! It appears to be related to query time.
  
  Looking at this latest run, for example, there were no recorded timeouts 
according to the CSV of the IGUANA `.nt` (樂).
  
  Taking things on a head-to-head basis for identical queries between the 
endpoints, here's what we see for the difference in speed for 
`wikidata_main_graph` minus `baseline`. It's unsurprising in a way given the 
distribution shown in the prior Phabricator comment, but it is another way of 
knowing that, under this parameters of this test anyway, that about 70% of the 
queries noted as successful seemed to be faster when run against the 
`wikidata_main_graph`. Note that about 16% of the queries hit `wrongCodes` / 
`failed`, which are discussed after the table.
  
  | Per-query wikidata_main_graph QPS minus baseline QPS | descriptor   
 |
  |  | 
- |
  | 0.722596509877809  | 
average   |
  | 0.244672300065055  | median 
   |
  | 79.4339558877256 | 100% max 
(i.e., wikidata_main_graph's biggest winner) |
  | 21.0654641024791 | 99%  
 |
  | 6.88080533343067  | 95% 
  |
  | 1.38414473312972  | 75% 
  |
  | 0.244672300065055  | 50%
   |
  | 0.013982881368447  | 42%
   |
  | 0| 41%  
 |
  | 0| 26%  
 |
  | -0.00701117502390231 | 25%  
 |
  | -0.215374628998983 | 20%
   |
  | -0.598658931613195 | 15%
   |
  | -1.41867399989265 | 10% 
  |
  | -4.16152316076897 | 5%  
  |
  | -18.0068429593504| 1%   
 |
  | -80.2800161266253| 0% min 
(i.e., baseline's biggest winner)  |
  |
  
  About 58% of queries titled toward `wikidata_main_graph`, and about 25% 
tilted toward `baseline`, and 58/(58+25) is about 0.7. The stuff where the 
difference is negligible probably don't matter that much. Yet, there's a bit 
more detail to consider in IGUANA's conception here...
  
  For the sake of completeness, and because this may be interesting to consider 
later on or to contextualize the QPS distributions in the prior Phabricator 
comment: looking at a different class of issues, let's suppose that we use 
`wrongCodes`as a proxy for things that could have gone wrong.`wrongCodes` and 
`failed` map to each other in the CSV, and their QPSes land as 0 for these 
(`penalizedQPS`, not included in the tables above, lands by default as 
0.017 for these records, but this is close enough to 0 if we wanted to 
look at it that way). These sorts of records thus drive down summary mean, 
median, and so on. As an aside, in terms of actual time (`totalTime`), these 
`wrongCodes` ones occupy very little time.
  
  | Endpoint Label  | count wrongCodes | sum wrongCodes | count failed | 
sum failed | count timeout | count QPS < 1.0 | count QPS < 5.0 | count QPS < 
20.0 | count QPS < 80.0 | count QPS < 200.0 |
  | --- |  | -- |  | 
-- | - | --- | --- | 
 |  | - |
  | baseline| 

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-01 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Here's the output from the latest run based upon a larger set of queries from 
a random sample of WDQS queries.
  
$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar 
org.aksw.iguana.rp.analysis.TabularTransform -e result.nt > result.execution.csv
$ cut -f1,3,5,6,7,9 -d"," result.execution.csv | sed 's/,/|/g'
  
  | endpointLabel   | taskStartDate| successfullQueries | 
successfullQueriesQPH | avgqps | queryMixesPH |
  | --- |  | -- | 
- | -- |  |
  | baseline| 2024-01-31T23:20:44.567Z | 319857 | 
136612.71246575614  | 18.83670491311007   | 1.732300885924224   
   |
  | wikidata_main_graph | 2024-02-01T04:23:01.613Z | 331473 | 
147674.12233239523  | 19.55930142298825   | 1.8725637484770261  
|
  |
  
  Here's the screen capture from Grafana.
  
  F41740308: Screenshot 2024-02-01 at 10.17.28 AM.png 
<https://phabricator.wikimedia.org/F41740308>
  
  The `wikidata_main_graph` window completed more queries despite an apparent 
bout of increased failing queries (climb began at about 0915 UTC), with a large 
garbage collection beginning about 5 minutes later (GC started at about 0920 
UTC; the GC actually continued well after the `wikidata_main_graph`'s window 
closure at 2024-02-01T09:23:55.639Z). This isn't the most interesting thing as 
it only constitutes about 1.5%-3.0% of the `wikidata_main_graph` window 
depending on how one looks at it, and I wouldn't necessarily read anything into 
whether such GCs would be likely to occur under the same conditions, but I 
wanted to note it nonetheless.
  
  To repeat the verbiage from the earlier runs...
  
  > Following below are "per-query" summary stats. I actually just put this 
together by bringing CSV data into Google Sheets for now - all of the columns 
are calculated upon the "per-query" rows (but you'll see how the Mean 
corresponds basically with the value calculated up above). The underlying CSV 
data don't bear actual queries (the .nt files from which they're generated do), 
...
  
  The CSV data were generated with the following command:
  `/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar 
org.aksw.iguana.rp.analysis.TabularTransform -q result.nt > result.query.csv`
  
  | Run  | Endpoint Label  | Mean | Median | Standard Deviation | 
Max (fastest) | 99% (very fast) | 0.95 | 0.75 | 0.5  | 0.25 | 1% (pretty slow) 
| Total w/ success |
  |  | --- |  | -- | -- | 
- | --- |  |  |  |  |  
|  |
  | randomized 1 | baseline| 18.8367049131101 | 14.6999663404689   
| 16.3589173757083   | 127.433177227691 | 59.009472115968   
 | 50.5734395961334 | 30.3470335487675 | 14.6999663404689 | 
4.97164300568995  | 0| 319857   |
  | randomized 1 | wikidata_main_graph | 19.5593014229883 | 16.0982853987134   
| 16.5098295290687   | 121.141149629509 | 58.9613256488317  
  | 51.0426872548935 | 31.751311031492 | 16.0982853987134 | 
5.37249826361878  | 0| 331473   |
  |
  
  Although the max and 99th percentile queries were just ever so slightly 
faster on the baseline "full" graph, more generally things were faster on the 
non-scholarly "main" graph. The performance difference is obvious but not 
dramatic.
  
  Here's the content of `wdqs-split-test-randomized-2024-01-31.yml`, comments 
removed for brevity. The main difference in this configuration file from the 
earlier presented one is five hours allowed per graph, to accommodate a larger 
query mix, and the updated filename pointing to the larger query mix based on 
the set of queries from the random sample.
  
datasets:
  - name: "split"
connections:
  - name: "baseline"
endpoint: "https://wdqs1022.eqiad.wmnet/sparql;
  - name: "wikidata_main_graph"
endpoint: "https://wdqs1024.eqiad.wmnet/sparql;

tasks:
  - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
configuration:
  timeLimit: 1800
  warmup:
timeLimit: 3
workers:
  - threads: 4
className: "SPARQLWorker"
queriesFile: 
"queries_for_performance_file_renamed_randomized_2024_01_31.txt"
timeOut: 5000
  queryHandler:
className: "DelimInstancesQueryHandler"
configuration:
  delim: "### BENCH DELIMITER ###"
  workers:
- threads: 4
  className: "SPARQLWorke

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-31 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  A run is in progress for 78K+ queries from a set of 100,000 random queries. 
It should be done in under 10 hours from now.
  
scala> val full_random = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified.parquet")

scala> val wikidata_random = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_random_classified.parquet")

scala> full_random.count
res0: Long = 10 

scala> wikidata_random.count
res6: Long = 10  

scala> val joined11 = 
wikidata_random.as("w").join(full_random.as("f")).where("w.id = f.id and 
w.success = true and  w.success = f.success and w.resultSize = f.resultSize and 
w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### 
BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, 
seed=42)

scala> joined11.count
res0: Long = 78862

scala> joined11.repartition(1).write.option("compression", 
"none").text("queries_for_performance_2024_01_31.txt")

scala> :quit

$ hdfs dfs -copyToLocal 
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_31.txt/part-0-29c4e72d-800d-4148-b804-8e428ee71e9e-c000.txt
 ./queries_for_performance_file_renamed_randomized_2024_01_31.txt

$ bash start-iguana.sh wdqs-split-test-randomized-2024-01-31.yml
  
  `start-iguna.sh` previously ran from `stat1006`, but this time around it's 
running from `stat1008` in order to use more RAM for the larger query mix.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-30 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Following below are "per-query" summary stats. I actually just put this 
together by bringing CSV data into Google Sheets for now - all of the columns 
are calculated upon the "per-query" rows (but you'll see how the Mean 
corresponds with the value calculated up above, with just slightly less 
precision). The underlying CSV data don't bear actual queries (the `.nt` files 
from which they're generated do), but rather rows of this form:
  

endpointLabel,task,queryId,totalTime,success,failed,timeouts,resultSize,unknownException,wrongCodes,qps,penalizedQPS

baseline,http://iguana-benchmark.eu/resource/1706221131/1/1,http://iguana-benchmark.eu/resource/1989023647/sparql0,53.592,2,0,0,1,0,0,37.319002836244216,37.319002836244216
  
  No big surprises here. The "per-query" behavior was similar between nodes. 
The "main" graph skewed somewhat faster over the full range of queries with one 
exception: the absolute fastest singular query for "randomized 1" run was 
slightly faster on the "baseline" full graph...but generally, everything else 
skewed faster for the "main" graph otherwise.
  
  **Per-query theoretical throughput (queries per second for given query)**
  
  | Run  | Endpoint Label  | Mean | Median | Standard Deviation 
| Max (fastest) | 99% (very fast) | 0.95 | 0.75 | 0.5  | 0.25 | 1% (pretty 
slow) | Total w/ success |
  |  | --- |  | -- | -- 
| - | --- |  |  |  |  | 
 |  |
  | non-randomized 1 | baseline| 32.6059031135735 | 
34.3489164537969   | 19.4337414434464   | 120.235661897318 
| 76.8554161439261| 55.7841155056668 | 49.1222469015747 | 
34.3489164537969 | 14.3159887843845 | 0.0564663314012815  | 15538   
 |
  | non-randomized 1 | wikidata_main_graph | 33.8619129716351 | 
35.7193884840691   | 20.2056740376445   | 148.610491900728 
| 81.2789060283893| 57.5619081064873 | 50.4922999242615 | 
35.7193884840691 | 15.098897780462 | 0.0625188498260728  | 16773
|
  | non-randomized 2 | baseline| 32.9728451327437 | 
34.7318699638788   | 19.5908672232246   | 128.890893858348 
| 74.7142465127726| 56.1419267909274 | 49.7404172035179 | 
34.7318699638788 | 14.6689672498449 | 0.0569581938930891  | 15893   
 |
  | non-randomized 2 | wikidata_main_graph | 34.0852093005914 | 36.106296938186 
  | 20.1931723422722   | 130.25921583952 | 82.0449565998977 
   | 57.6139754652666 | 50.625221485344 | 36.106296938186 | 
15.378937007874 | 0.0622306059862422  | 16780|
  | randomized 1 | baseline| 32.8878633004489 | 
34.6404323125952   | 19.8757923913207   | 136.072935093209 
| 79.2782608462366| 56.2164107372478 | 49.4926998267755 | 
34.6404323125952 | 14.1755500113404 | 0.0557895216707227  | 15180   
 |
  | randomized 1 | wikidata_main_graph | 33.9156003706814 | 
35.7091844022282   | 20.2748013579631   | 132.082948091401 
| 81.0501654381498| 57.747392312958 | 50.5101525406606 | 
35.7091844022282 | 15.079658294943 | 0.0574330048487202  | 15929
|
  | randomized 2 | baseline| 33.007109052298 | 34.5670904661201 
  | 19.8511760316909   | 133.904659882163 | 
81.4017649973028| 56.1335754963953 | 49.6176341128876 | 
34.5670904661201 | 14.3154187934342 | 0.0538090222917457  | 15211   
 |
  | randomized 2 | wikidata_main_graph | 34.1402036271541 | 
36.0958706323996   | 20.2577595936201   | 134.156157767641 
| 83.0239310934627| 57.7850982363029 | 50.5292946486809 | 
36.0958706323996 | 15.7122834258313 | 0.0589775584599512  | 16084   
 |
  |

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-27 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Here were the data produced by IGUANA once piped through the CSV utility 
introduced in 
https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs
 with a command of the following form (for the attentive reader, note that I 
had to rename the originally produced files to have an `.nt` extension to make 
the underlying Jena libraries not throw an exception).
  
  `/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar 
org.aksw.iguana.rp.analysis.TabularTransform -e result.003.nt > 
result.003.execution.csv`
  
  | run  | endpointLabel   | taskStartDate| 
successfullQueries | successfullQueriesQPH | avgqps | queryMixesPH |
  |  | --- |  | 
-- | - | -- |  |
  | non-randomized 1 | baseline| 2024-01-25T22:18:57.753Z | 15538   
   | 17512.446990539123   | 32.60590311357346   | 
0.9895715087607575  |
  | non-randomized 1 | wikidata_main_graph | 2024-01-25T23:19:56.948Z | 16773   
   | 19125.484555828807   | 33.86191297163505   | 
1.0807190233276154  |
  | non-randomized 2 | baseline| 2024-01-26T01:47:41.634Z | 15893   
   | 17955.609618256018   | 32.97284513274341   | 
1.0146131897076351  |
  | non-randomized 2 | wikidata_main_graph | 2024-01-26T02:48:41.047Z | 16780   
   | 19145.810254441058   | 34.085209300591515   | 
1.0818675625496446  |
  | randomized 1 | baseline| 2024-01-26T16:51:54.091Z | 15180   
   | 17068.107622599186   | 32.88786330044905   | 
0.9644633340452725  |
  | randomized 1 | wikidata_main_graph | 2024-01-26T17:52:52.903Z | 15929   
   | 17969.809300477013   | 33.91560037068121   | 
1.0154155676372838  |
  | randomized 2 | baseline| 2024-01-26T19:37:30.811Z | 15211   
   | 17054.882354485933   | 33.00710905229813   | 
0.9637160170924978  |
  | randomized 2 | wikidata_main_graph | 2024-01-26T20:38:29.989Z | 16084   
   | 18210.142239149543   | 34.14020362715409   | 
1.0289960015341326  |
  |
  
  Keep in mind that a delay between was introduced in the configuration for 
these "stress tests" (a "stress test" here means that the execution of the 
queries goes continuously for the specified time interval at its concurrency 
and delay spec). This was to more closely model what a somewhat busy, but not 
completely saturated, WDQS node might experience, although we should be mindful 
that the server specs are a bit different between these test servers and the 
WDQS hosts used for serving end user WDQS production requests. When 
interpreting a value like `avgqps`, remember that this is akin to what might 
happen if queries were executed serially without delay if it were possible to 
hold JVM performance constant for such request patterns (do note that this is 
generally not possible to guarantee, so caveats abound; in other words it's 
entirely possible that `avgqps` could degrade in reality).
  
  The `successfulQueriesQPH` metric is probably the most interesting one. It's 
suggestive of about a 5%-10% speed advantage for the smaller "main" graph 
versus a fully populated "full" graph for this query mix when conditions model 
a somewhat busy WDQS node (again, remember that server spec is a bit different 
between the SUT and production nodes so there is a caveat). Additional basic 
summary statistics upon the data from with per-query CSV exports (using the 
`-q` flag) against the `.nt` files to come.
  
  Note that in Andrea's previous analysis these sorts of statistics (as well as 
some tweaks to get somewhat finer precision via `BigDecimal` instead of 
`Double` types) were incorporated directly into the Java source of IGUANA - see 
https://github.com/dice-group/IGUANA/compare/main...AndreaWesterinen:IGUANA:main
 for changes up to June 13, 2022 against current main branch of IGUANA; n.b., 
to future readers you may need to re-correlate the code changes when IGUANA 
upstream changes. But, I opted to make fewer changes to our fork (i.e., I 
didn't merge Andrea's fork into our fork, even if there is some dependency 
similarity in the POMs) as this data can be determined in Spark summary stat 
calls. We may be interested in how to take forward some of the enhancement 
opportunities for IGUANA upstream should we see the need for more IGUANA work 
later, but then again we may not do that as our needs are narrower.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maanti

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-27 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Now a screenshot from the re-run of the randomized order queries, followed by 
a screenshot showing the two runs on the randomized order queries side by side.
  
  F41722569: Screenshot 2024-01-27 at 6.36.58 AM.png 
<https://phabricator.wikimedia.org/F41722569>
  
  F41722573: Screenshot 2024-01-27 at 6.38.45 AM.png 
<https://phabricator.wikimedia.org/F41722573>

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-26 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Now, the screenshot from the randomized order queries. I'll run one more time 
to see that comparable output is achieved. Those were produced with the 
following. This latest output file has been moved to `result.nt.003`.
  
scala> val joined6 = wikidata.as("w").join(full.as("f")).where("w.id = f.id 
and w.success = true and  w.success = f.success and w.resultSize = f.resultSize 
and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), 
lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, 
fraction=1.0, seed=42)
scala> joined6.count // matches same as joined5.count
scala> joined6.repartition(1).write.option("compression", 
"none").text("queries_for_performance_randomized_2024_01_26.txt")
scala> :quit
$ hdfs dfs -copyToLocal 
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_randomized_2024_01_26.txt/part-0-131df78f-da7a-4ffc-aad4-9874342165ca-c000.txt
 ./queries_for_performance_randomized.txt 
$ sha1sum queries_for_performance.txt queries_for_performance_randomized.txt
$ # they're different
$ diff queries_for_performance.txt queries_for_performance_randomized.txt | 
wc -l
$ # they're very different
$ cp wdqs-split-test.yml wdqs-split-test-randomized.yml
$ # changed pointers to query file to be 
queries_for_performance_randomized.txt
$ bash start-iguana.sh wdqs-split-test-randomized.yml
$ mv result.nt result.nt.003

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-26 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Now, a screenshot showing the re-run. And then a screenshot showing them 
side-by-side. This is just for the visual, and the data produced from IGUANA 
(what is in the `.nt` output that we can convert to a handy CSV) should be more 
telling.
  
  Next up, I'll randomize the order of the queries and do it again.
  
  F41720004: Screenshot 2024-01-26 at 10.19.36 AM.png 
<https://phabricator.wikimedia.org/F41720004>
  
  F41720006: Screenshot 2024-01-26 at 10.20.48 AM.png 
<https://phabricator.wikimedia.org/F41720006>

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Dropping in a screenshot from Grafana from this first pass and made a copy of 
`result.nt` to `result.nt.001`. Re-running to see that server behavior is 
similar.
  
  F41718197: Screenshot 2024-01-25 at 7.43.14 PM.png 
<https://phabricator.wikimedia.org/F41718197>

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  For the first pass, the following configuration is being used for an hour 
long test conducted from `stat1006` with config file `wdqs-split-test.yml` as 
follows.
  
datasets:
  - name: "split"
connections:
  - name: "baseline"
endpoint: "https://wdqs1022.eqiad.wmnet/sparql;
  - name: "wikidata_main_graph"
endpoint: "https://wdqs1024.eqiad.wmnet/sparql;

tasks:
  - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
configuration:
  timeLimit: 360
  warmup:
timeLimit: 3
workers:
  - threads: 4
className: "SPARQLWorker"
queriesFile: "queries_for_performance.txt"
timeOut: 5000
  queryHandler:
className: "DelimInstancesQueryHandler"
configuration:
  delim: "### BENCH DELIMITER ###"
  workers:
- threads: 4
  className: "SPARQLWorker"
  queriesFile: "queries_for_performance.txt"
  timeOut: 6
  parameterName: "query"
  gaussianLatency: 100

metrics:
  - className: "QMPH"
  - className: "QPS"
  - className: "NoQPH"
  - className: "AvgQPS"
  - className: "NoQ"

storages:
  - className: "NTFileStorage"
configuration:
  fileName: result.nt
  
  `queries_for_performance.txt` is based on the following basic code, which 
says to get queries known to work against both the full graph and the main 
(non-scholarly) graph and returning similar results, so as to reduce garbage 
input and somewhat better control the parameters of the test.
  
scala> val wikidata = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_classified.parquet")
scala> val full = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_classified.parquet")
scala> val joined5 = wikidata.as("w").join(full.as("f")).where("w.id = f.id 
and w.success = true and  w.success = f.success and w.resultSize = f.resultSize 
and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), 
lit("\n### BENCH DELIMITER ###"))).distinct
scala> joined5.repartition(1).write.option("compression", 
"none").text("queries_for_performance_2024_01_25.txt")
scala> :quit

$ hdfs dfs -copyToLocal 
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_25.txt/part-0-6b8caed3-3a4d-4cb2-bf74-6bbcd7af0478-c000.txt
 ./queries_for_performance.txt
$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar iguana-3.3.3.jar 
wdqs-split-test.yml
  
  The IGUANA build is based on 
https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/4 .

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2024-01-04 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Imports seemed to work.
  
  **Non-scholarly article side (proxied to wdqs1024.eqiad.wmnet)**
  F41650681: split-non-schol-side.gif 
<https://phabricator.wikimedia.org/F41650681>
  
  **Scholarly article side (proxied to wdqs1023.eqiad.wmnet)**
  F41650680: split-schol-side.gif <https://phabricator.wikimedia.org/F41650680>
  
  Next steps:
  
  - Add automated unit test(s) to the patch.
  - Add doc / pointer to Pastes somewhere handy
  
  Also, non-blocking for this here task, but mentioning here for findability - 
the queries in T349512: [Analytics] Collect multiple sets of SPARQL queries 
<https://phabricator.wikimedia.org/T349512> will provide the fuller view on 
query coverage and their runtime characteristics.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Gehel, RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, 
JAllemandou, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, 
Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, 
Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, 
Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, 
merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  After an update to the script (PS6) and a fresh run of the same commands new 
files have been `hdfs-rsync`'d to `stat1006:~dr0ptp4kt/gzips` in anticipation 
of doing a file transfer over to the WDQS graph split test servers.
  
  Here's a very small sample of what the files look like:
  
$ zcat part-01022-c261bb68-4091-4613-ae52-88ce97d22c14-c000.txt.gz | tail 
-10
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"\u0935\u093F\u0915\u093F\u092E\u093F\u0921\u093F\u092F\u093E 
\u0936\u094D\u0930\u0947\u0923\u0940"@ne .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"\u043A\u0430\u0442\u0435\u0433\u043E\u0440\u0438\u0458\u0430 \u043D\u0430 
\u0412\u0438\u043A\u0438\u043C\u0435\u0434\u0438\u0458\u0438"@sr .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"\u7DAD\u57FA\u5A92\u9AD4\u5206\u985E"@yue .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"Wikimedia-Kategorie"@de-ch .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"catigur\u00ECa di nu pruggettu Wikimedia"@scn .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"categoria di un progetto Wikimedia"@it .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/version> 
"1979010859"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"kategori Wikimedia"@map-bms .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"Wikimedia-kategoriija"@se .
<http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> 
"\u7DAD\u57FA\u5A92\u9AD4\u5206\u985E"@zh-mo .

$ zcat part-01023-c261bb68-4091-4613-ae52-88ce97d22c14-c000.txt.gz | head 
-10

<http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5>
 <http://wikiba.se/ontology#rank> <http://wikiba.se/ontology#NormalRank> .

<http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5>
 <http://www.wikidata.org/prop/statement/P31> 
<http://www.wikidata.org/entity/Q4167836> .

<http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5>
 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://wikiba.se/ontology#BestRank> .

<https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973>
 <http://schema.org/about> <http://www.wikidata.org/entity/Q99896811> .

<https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973>
 <http://schema.org/isPartOf> <https://ar.wikipedia.org/> .

<https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973>
 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Article> .

<https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973>
 <http://schema.org/inLanguage> "ar" .

<https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973>
 <http://schema.org/name> 
"\u062A\u0635\u0646\u064A\u0641:\u0634\u0631\u0643\u0627\u062A 
\u0633\u0648\u064A\u0633\u0631\u064A\u0629 \u0623\u0633\u0633\u062A 
\u0641\u064A 1973"@ar .

<https://en.wikipedia.org/wiki/Category:Swiss_companies_established_in_1973> 
<http://schema.org/inLanguage> "en" .

<https://en.wikipedia.org/wiki/Category:Swiss_companies_established_in_1973> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Article> .
  
  You'll notice that the the files are partitioned by `context` and `subject`, 
and within a partition they're also sorted by `context` and `subject` (the 
`context` field isn't part of the output, though; one would get that from the 
source tables). So you may see, as in this example, things that are logically 
clustered together spanning from the end of one file and the beginning of the 
next partition in sequence.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-04 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: RKemper.
dr0ptp4kt added a comment.


  I ran the current version of the code as follows:
  
spark3-submit --master yarn --driver-memory 16G --executor-memory 12G 
--executor-cores 4 --conf spark.driver.cores=2 --conf 
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf 
spark.dynamicAllocation.maxExecutors=128 --conf 
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 
--class 
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator 
--name wikibase-rdf-statements-spark 
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar 
--input-table-partition-spec 
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=wikidata_main
 --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main 
--num-partitions 1024
  
  
  
spark3-submit --master yarn --driver-memory 16G --executor-memory 12G 
--executor-cores 4 --conf spark.driver.cores=2 --conf 
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf 
spark.dynamicAllocation.maxExecutors=128 --conf 
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 
--class 
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator 
--name wikibase-rdf-statements-spark 
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar 
--input-table-partition-spec 
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=scholarly_articles
 --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol 
--num-partitions 1024
  
  And updated the permissions.
  
hdfs dfs -chgrp -R analytics-search-users 
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main
  
  
  
hdfs dfs -chgrp -R analytics-search-users 
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol
  
  From stat1006 it is possible to use the already present `hdfs-rsync` (script 
fronting Java utility) to copy the produced files, like this:
  
hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol/ 
file:/destination/tot/nt_wd_schol_gzips/
  
  
  
hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main/ 
file:/destination/to/nd_wd_main_gzips/
  
  Note: each directory has 1,024 files of 100 MB +/- a certain number of MB. 
The Spark routine randomly samples the data before sorting into partitions, and 
although all partitions have data, there's mild skew so the files aren't all 
exactly the same number of records.
  
  @bking / @RKemper / @dcausse / I will discuss more this week.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, 
dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, 
Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, 
karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, 
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, 
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-04 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Not using right now, but here's roughly how one might go about generating 
more expanded Turtle statements without reverse-mapping prefixes: F41561068 
<https://phabricator.wikimedia.org/F41561068>

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, 
Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, 
Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, 
karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, 
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, 
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-11-29 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: EBernhardson.
dr0ptp4kt added a comment.


  Adding a note so I don't forget: advice from @BTullis is to avoid NFS if 
possible, and advice from @JAllemandou is to consider use of `hdfs-rsync` 
(after our call I sought this out and found these: 
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/python/refinery/hdfs.py
 and 
https://gerrit.wikimedia.org/g/analytics/hdfs-tools/deploy/+/2445aec92f6b3d409531fb74ab3f9a22d9716823/bin/hdfs-rsync
 and 
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/hdfs-rsync
 ). Chances are we'd need to add a ferm and possibly where up some Kerberos 
stuff on the WDQS servers if going the hdfs-rsync route.
  
  During a Meet today @EBernhardson and I with the group were discussing 
possible use of a mechanism similar to 
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/shared/transfer_to_es.py?ref_type=heads#L74-83
 and 
https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/mjolnir/kafka/bulk_daemon.py?ref_type=heads
 where a file is moved to Swift via Airflow and Mjolnir client code listens for 
the Kafka events of the URLs from which to fetch the produced files (I haven't 
read this code closely yet, just parroting what I think I heard).
  
  We'll likely need to do these data transfers more than once, so it'll be good 
to have some level of support of automation.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, 
Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-11-29 Thread dr0ptp4kt
dr0ptp4kt claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, 
Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  The job completed. The counts match up on this productionized job compared 
with the prior one run in my namespace. Following are some Hive queries in case 
needed later. Below that is a really small sample of the resultant data in 
tabular format for each partition.
  
  **Counts**
  
select count(1) from discovery.wikibase_rdf_scholarly_split where snapshot 
= '20231016' and wiki = 'wikidata' and
scope = 'scholarly_articles';
7643858365
  
  
  
select count(1) from discovery.wikibase_rdf_scholarly_split where snapshot 
= '20231016' and wiki = 'wikidata' and
scope = 'wikidata_main';
7677112695
  
  **Samples**
  
  Note: because the target sample size is so small, it's actually possible to 
get slightly less than the target number of records due to sparseness in a 
randomly selected set. One can compensate by setting the numerator higher or 
the denominator lower if one likes to reduce the possibility of such potential 
artifacts (e.g., to avoid getting 27 records when one really wants 30 records; 
below we get 30 records apiece, mind you). Note the horizontal scrollbars at 
the bottom of the tabular data in case the tables overflow on one's browser's 
settings in Phabricator (mine do).
  
select "| " || concat_ws(" | ", subject, predicate, object, context) from 
discovery.wikibase_rdf_scholarly_split where snapshot = '20231016' and wiki = 
'wikidata' and
scope = 'scholarly_articles' and rand() <= (30/7643858365) distribute by 
rand() sort by rand() limit 30;
  
  {icon graduation-cap spin}
  
  | subject 
 | predicate   | object 
 | 
context   |
  | 

 | --- | 
---
 | - |
  | 
http://www.wikidata.org/entity/statement/Q114851466-BB650063-6818-4AF5-88FD-743A5520811C
 | http://www.w3.org/ns/prov#wasDerivedFrom| 
http://www.wikidata.org/reference/a84e44b8b704dd021b87b792549c1623fc1edff3  
| http://www.wikidata.org/entity/Q114851466 |
  | 
http://www.wikidata.org/entity/statement/Q73327727-EE2DF999-D668-4D6A-860F-B5FE8B93747E
  | http://wikiba.se/ontology#rank  | 
http://wikiba.se/ontology#NormalRank
| http://www.wikidata.org/entity/Q73327727  |
  | http://www.wikidata.org/entity/Q45987415
 | http://www.wikidata.org/prop/direct/P407| 
http://www.wikidata.org/entity/Q1860
| http://www.wikidata.org/entity/Q45987415  |
  | 
http://www.wikidata.org/entity/statement/Q44327803-9B2ED327-7B22-41B3-927D-F0D780F14C63
  | http://wikiba.se/ontology#rank  | 
http://wikiba.se/ontology#NormalRank
| http://www.wikidata.org/entity/Q44327803  |
  | http://www.wikidata.org/entity/Q40775359
 | http://schema.org/description   | 
"\u043D\u0430\u0443\u0447\u043D\u0430\u044F 
\u0441\u0442\u0430\u0442\u044C\u044F"@ru| 
http://www.wikidata.org/entity/Q40775359  |
  | 
http://www.wikidata.org/entity/statement/Q33904556-172A1324-DF02-4555-AC23-CD26DED1A182
  | http://www.wikidata.org/prop/statement/P304 | "49-52"   
  | 
http://www.wikidata.org/entity/Q33904556  |
  | http://www.wikidata.org/entity/Q21994578
 | http://schema.org/description   | 
"wetenschappelijk artikel (gepubliceerd op 2009/10/09)"@nl  
| http://www.wikidata.org/entity/Q21994578  |
  | 
http://www.wikidata.org/entity/statement/Q93701619-747BA9CD-B887-4755-A744-01607FD15567
  | http://wikiba.se/ontology#rank  | 
http://wikiba.se/ontology#NormalRank
| http://www.wikidata.org/entity/Q93701619  |
  | 
http://www.wikidata.org/entity/statement/Q42812060-1DBF45B2-E920-4CF6-8011-A94820FF10EA
  | http://wikiba.se/ontology#rank  | 
http://wikiba.se/ontology#NormalRank
| http://www.wikidata.org/entity/Q42812060  |
  | 
http://www.wikidata.org/entity/statement/Q36819529-349D4DA8-BC3D-4B01-90F4-C5D42F4E3683
  | http://www.wikidata.org/prop/statement/P50  | 
http://www.wikidata.org/entity/Q58034888
|

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt triaged this task as "High" priority.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: EBernhardson, bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, 
Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, 
maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T337013: [Epic] Splitting the graph in WDQS

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt closed subtask T347989: Adapt rdf-spark-tools to split the wikidata 
graph based on a set of rules as Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T337013

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, RKemper, bking, tfmorris, elal, karapayneWMDE, Aklapper, 
Lydia_Pintscher, me, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, 
BeautifulBold, Suran38, Invadibot, maantietaja, Peteosx1x, NavinRizwi, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-16 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: EBernhardson.
dr0ptp4kt added a comment.


  Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch 
merged. This is deployed to Search's Airflow instance and the job is running. 
Thank you, @dcausse and @EBernhardson.
  
  Here's the location of stuff for this job that's currently running.
  
--deploy-mode cluster 
hdfs:///wmf/cache/artifacts/airflow/search/rdf-spark-tools-0.3.137-jar-with-dependencies.jar
--input-table-partition-spec 
discovery.wikibase_rdf_t337013/date=20231016/wiki=wikidata
--output-table-partition-spec 
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata
max_attempts: 1

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: EBernhardson, bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, 
Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, 
maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-15 Thread dr0ptp4kt
dr0ptp4kt moved this task from In Progress to Needs review on the 
Discovery-Search (Current work) board.
dr0ptp4kt added a comment.


  Here's what I saw after re-running. So, we should be good with the latest 
patchset that goes without distinct() on the final graphs.
  
  Without distinct() on final graphs - 1h48m
  [dr0ptp4kt.wikibase_rdf_scholarly_split_refactor_no_distinct_less_cache]
  scholarly_articles: 7_643_858_365, wikidata_main: 7_677_112_695
  
  With distinct() on final graphs - 1h55m
  [dr0ptp4kt.wikibase_rdf_scholarly_split_refactor_using_distinct_less_cache]
  scholarly_articles: 7_643_858_365, wikidata_main: 7_677_112_695

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, 
Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, 
maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-26 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Update: it seems to be working. This, I'd say this is maybe 75% complete.
  
  It takes about 1h40m to run and generate the two different partitions.
  
  WIP/Draft patches posted at 
https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/969229 and ^ . They 
require some refactoring and introduction of tests, and probably some extra 
config variables - I'll connect with Joseph about that last part.
  
  David, Erik, and I spoke through things earlier today while I opened the 
repos in my IDE. I'll request code review so I can iterate on this.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, 
Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, 
Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, 
maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, 
Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, 
_jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  I also see 
https://grafana.wikimedia.org/d/00264/wikidata-dump-downloads?orgId=1=5m=now-2y=now
 which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and 
delete many old dumps logs <https://phabricator.wikimedia.org/T280678> and 
friends) and a pointer from a colleague).
  
  As I noted, there are some complications around the 200s, and I see from 
T280678 <https://phabricator.wikimedia.org/T280678>'s pointer to source 
processing at 
https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/dumpDownloads.php#L12
 consideration for 206s and 200s. Future TODO in case we wanted to figure out 
how to deal with the different-sized 200s and apparent downloader utilities.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  ^ Update.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Looking at yesterday's downloads with a rudimentary grep we're not far from 
1K downloads, and that's just for the //latest-all// ones. That also doesn't 
consider mirrors.
  
stat1007:~$ zgrep wikidatawiki 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20231024.gz | grep 
latest-all | grep " 200 " | wc -l
  
  Now, it's good to keep in mind that some of these downloads are mirror jobs 
themselves, but looking at some of the source IPs it's clear that a good number 
of them also are not mirrors.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-20 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  It took about 26min 24s to write `S_direct_triples` (7_293_925_470 rows) in 
basic Parquet. It's not all the rows (not even for its own partition, as that 
will include Value and Reference triples as well), but this means it ought to 
be possible for the job to write total 15B rows with about an hour of wall time 
(maybe double that to play it safe).

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-20 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  TL;DR this is about 45% done.
  
  This week I was working to address non-performant, often hanging or crashy, 
Spark runs. Last night I managed to get this running better, producing a 
reduction (the equivalent of `val_triples_only_used_by_sas` from 
https://people.wikimedia.org/~andrewtavis-wmde/T342111_spark_sa_subgraph_metrics.html
 ) in 8 minutes in one pass - instead of 3 hours or, worse, something longer 
followed by an indefinite hang or crash.
  
  The key here was a couple things. First, higher resource limits (this seems 
obvious, but isn't always true) and attempting to prevent Spark from broadcast 
joins (it still tries to do them based on the Spark web UI's DAGs, but doesn't 
seem to do them at bad times, at least).
  
"spark.driver.memory": "16g",
"spark.driver.cores": 2,
"spark.executor.memory": "12g",
"spark.executor.cores": 4,
"spark.executor.memoryOverhead": "4g",
"spark.sql.shuffle.partitions": 512,
'spark.dynamicAllocation.maxExecutors': 128,
'spark.locality.wait': '1s', # test 0
'spark.sql.autoBroadcastJoinThreshold': -1
  
  Second, removal of `cache()` calls and setting some join tables as their own 
DataFrames. This means likely in practice more disk-based merge behavior on the 
executors for huge joins, but it works better. I'm interested to explore 
bucketing as an optimization strategy, but may forego this for production of 
the table as it doesn't seem necessary at the moment - it may however be useful 
for the produced table for people doing further join operations so am thinking 
about this.
  
  I had the small reduction pushing to a Parquet directory in HDFS last night. 
I will be working to see how performant and reliable pushing a larger data set 
is and will report back here. From there I'll port from Python to Scala.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-16 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Good question - I meant the contrast with respect to the .ttl.gz dumps and 
everything that goes into munging and importing (in aggregate across all 
downloaders of those files) versus the same for if this was done with the .jnl 
where they don't have to munge and import. Napkin-mathsing it, the thought was 
that the savings on energy accrues about as soon as the 16 cores x 12 hours of 
compression time on the .jnl has been "saved" by people in aggregate not 
needing to run the import process (and I'm just waving away the client side 
decompression, which in a way technically happens twice for the .ttl.gz user 
but only once for the .jnl.zst user, and any other disk or network transfer 
pieces, as those are all close enough, I suppose).
  
  I'll go check on what stats may be readily available on dumps downloads.
  
  Good point on having a checksum and timestamp. Yeah, it would be nice to have 
it in an on-demand place without the need for extra data transfer!

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Personalized dev environment on analytics cluster with Airflow setup 
(stat1006) - was able to execute job, slightly hacked up to get specific date 
and not keep running regularly (eats lots of disk) to get 
`dr0ptp4kt.wikibase_rdf_with_split` using my Kerberos principal. Verifying 
Jupyter notebook approach from David / Andy on stat1005 - some glitches as to 
be expected, but worked okay by doubling timeouts and removing some caps. Next 
up, working on a job that will do the splitting in a fashion similar to what's 
achieved with the join-antijoin approach of the notebooks. I'll want to have 
the produced data separated out from the existing table, I think - in this case 
it would be okay in my opinion to use some extra disk.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T344905: Publish WDQS JNL files to dumps.wikimedia.org

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  > I think the ammount of time taken to decompress the JNL file should also be 
taken into consideration on varying hardware if compression is being considered.
  
  Closing the loop, posted my experience at T347605#9229608 
<https://phabricator.wikimedia.org/T347605#9229608>.

TASK DETAIL
  https://phabricator.wikimedia.org/T344905

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: xcollazo, bking, Krinkle, dr0ptp4kt, Abbe98, Gehel, Addshore, Aklapper, 
Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, 
BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, 
Akuckartz, WDoranWMF, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  @bking just wanted to express my gratitude for the support on this ticket and 
its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org 
<https://phabricator.wikimedia.org/T344905> and T347647: 2023-09-18 
latest-all.ttl.gz WDQS dump `Fatal error munging RDF 
org.openrdf.rio.RDFParseException: Expected '.', found 'g'` 
<https://phabricator.wikimedia.org/T347647>. FWIW I do think it would be good 
to automate this. As a matter of getting to a functional WDQS local environment 
replete with BlazeGraph data, it would accelerate things a lot. I think my only 
reservations are that:
  
  1. It takes time to automate. Any rough guess on level of effort for that? I 
understand that'd inform relative prioritization against the large pile of 
other things.
  2. The energy savings is possibly unclear, at least under current case (but 
that's partly because it's hard to know how much energy is being expended, 
which could be guessed at from number of dump downloads; not sure how easy it 
is to get those stats; this is different from the bandwidth transfer on 
Cloudflare R2).
  
  However, I would probably err on the side of assuming that ultimately the 
automation will boost the technical communities' interest and ability to trial 
things locally (right now the barriers are somewhat prohibitive) and that the 
energy savings will roughly net out - ironically, if it attracts more people, 
they'll in the aggregate consume more energy, but they'll also be vastly more 
efficient energy-wise because they won't have to ETL, which takes a lot of 
compute resources. For potential reusers (e.g., Enterprise or other 
institutions) it might help smooth things along a bit, although this is mostly 
just my conjecture.
  
  Thinking ahead a little, we'd probably want to generalize anything so that it 
can take arbitrary `.jnl`s, for example for split graphs.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bking, dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Addressing @Addshore's comment in T344905#9210122 
<https://phabricator.wikimedia.org/T344905#9210122>...
  
  > I think the amount of time taken to decompress the JNL file should also be 
taken into consideration on varying hardware if compression is being considered.
  
  Here's what I saw for performance:
  
/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst: 1265888788480 bytes

real219m10.733s
user29m51.350s
sys 12m53.425s
  
  This was on an i7-8700 CPU @ 3.20GHz. When I checked this with `top` it 
seemed to be using about 0.8-1.6 processor, but hovering around 1 processor, at 
any given time. `unzstd` doesn't support multiple processor decompress from 
what I see.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Drawing from your inspiration, I downloaded with `wget` overnight and the 
`sha1sum' now matches that from `wdqs1016`. Deflating now, will update with 
results.

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt added a comment.


  I'm going to close this for now given that the later dump munged okay and 
there seems to be an underlying issue somewhere probably related to file 
transfer. The ``-- --skolemize`` will be a thing to consider for any future 
run, nonetheless.

TASK DETAIL
  https://phabricator.wikimedia.org/T347647

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, 
Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, 
Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  I did manage to run a `sha1sum` on the older dump where the import had failed.
  
/mnt/w$ time sha1sum latest-all.ttl.gz
dedad5a589b3a3661a1f9ebb7f1a6bcbce1b4ef2  latest-all.ttl.gz

real28m47.000s
user3m21.104s
sys 0m46.825s

$ ls -al latest-all.ttl.gz
-rwxrwxrwx 1 adam adam 129294028486 Sep 27 05:35 latest-all.ttl.gz
  
  It seems like there was a data corruption somewhere in the transfer or 
persistence to disk or post-download. I don't see this `sha1sum` anywhere. It's 
conceivable something went wrong during the course of the `sha1sum`s 
themselves, but I'm not going to spend more time on this. Just wanted to 
document for future selves. Just a remark: normally, one would expect that the 
download would fail if it were in the transfer itself.

TASK DETAIL
  https://phabricator.wikimedia.org/T347647

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, 
Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, 
Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Here's the `sha1sum` for the latest file I had downloaded:
  
/mnt/x$ time sha1sum wikidata.jnl.zst
62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab  wikidata.jnl.zst

real77m16.215s
user8m39.726s
sys 2m42.932s

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T347989

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  For me the first 300 GB of the file went really, really fast. But `axel` was 
dropping connections, similar to when I had downloaded the large 1 TB file. So 
this download took about 5 hours. I'm pretty sure it could be done in 1-3 
hours, though, if everything were working well.
  
  Now, I encountered an error, and this was reproducible with two separate 
downloads. @bking does a test on the file yield the same corrupted block 
detected warning for you by any chance if you download the the zst? What about 
if you do it with your already existing copy?
  
/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst : 649266 MB... wikidata.jnl.zst : Decoding error 
(36) : Corrupted block detected

real124m59.115s
user17m44.647s
sys 7m24.509s

/mnt/x $ ls -l wikidata.jnl.zst
-rwxrwxrwx 1 adam adam 342189138219 Oct  3 02:32 wikidata.jnl.zst
  
  I've kicked off a `sha1sum`, but this will take a while to run.
  
/mnt/x$ time sha1sum wikidata.jnl.zst

TASK DETAIL
  https://phabricator.wikimedia.org/T347605

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-02 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  The addshore .jnl (August file) does launch nicely with `./runBlazegraph.sh`.

TASK DETAIL
  https://phabricator.wikimedia.org/T347647

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, 
Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, 
Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-09-30 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  The addshore .jnl (August file) download completed, with use of the Linux 
tool axel. Working from my memory as I checked on the download on my 1 Gbps 
connection the first 800 or so GB downloaded over the first 3-4 hours, thrn (as 
some Clodflare connections seemed to fall off) the remaining 400 or so GB took 
another 18 hours, so total download time was about 22 hours. Next will be to 
verify that it loads cleanly.

TASK DETAIL
  https://phabricator.wikimedia.org/T347647

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-09-29 Thread dr0ptp4kt
dr0ptp4kt added a comment.


  Update - the newer dump munged without any problems.

TASK DETAIL
  https://phabricator.wikimedia.org/T347647

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Astuthiodit_1, 
AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


  1   2   >