[Wikidata-bugs] [Maniphest] T215413: Image Classification Research and Development
dr0ptp4kt removed a project: Reading-Admin. TASK DETAIL https://phabricator.wikimedia.org/T215413 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Miriam, dr0ptp4kt Cc: dr0ptp4kt, fkaelin, AikoChou, Capankajsmilyo, Mholloway, Ottomata, Jheald, Cirdan, MoritzMuehlenhoff, CDanis, akosiaris, SandraF_WMF, Fuzheado, PDrouin-WMF, Krenair, d.astrikov, JoeWalsh, Nirzar, dcausse, fgiunchedi, JAllemandou, leila, Capt_Swing, mpopov, Nuria, DarTar, Halfak, Gilles, EBernhardson, MusikAnimal, Abit, elukey, diego, Cparle, Ramsey-WMF, Miriam, Isaac, me, Danny_Benjafield_WMDE, Mohamed-Awnallah, S8321414, KinneretG, Astuthiodit_1, YLiou_WMF, BeautifulBold, EChetty, lbowmaker, Suran38, BTullis, karapayneWMDE, Invadibot, GFontenelle_WMF, Ywats0ns, maantietaja, FRomeo_WMF, Peteosx1x, NavinRizwi, ItamarWMDE, Nintendofan885, Akuckartz, Dringsim, 4748kitoko, Nandana, JKSTNK, Akovalyov, Abdeaitali, Lahi, Gq86, E1presidente, GoranSMilovanovic, QZanden, EBjune, KimKelting, Tramullas, Acer, V4switch, LawExplorer, Salgo60, Avner, Silverfish, _jensen, rosalieper, Scott_WUaS, Susannaanas, Wong128hk, Jane023, terrrydactyl, Wikidata-bugs, Base, matthiasmullie, aude, Daniel_Mietchen, Dinoguy1000, Ricordisamoa, Wesalius, Lydia_Pintscher, Fabrice_Florin, Raymond, Steinsplitter, Matanya, Mbch331, jeremyb ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T123349: EPIC: Article placeholders using wikidata
dr0ptp4kt removed a project: Reading-Admin. TASK DETAIL https://phabricator.wikimedia.org/T123349 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: waldyrious, Lydia_Pintscher, Nasirkhan, Aklapper, StudiesWorld, Lucie, atgo, dr0ptp4kt, JKatzWMF, me, BeautifulBold, Suran38, Peteosx1x, NavinRizwi, cmadeo, SBisson, Wikidata-bugs, Dinoguy1000, jayvdb, Ricordisamoa ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt added a comment. I actually just added a link to https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#See_also . Marking this here ticket as resolved after noticing it was still open. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, AndrewTavis_WMDE, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T352538: [EPIC] Evaluate the impact of the graph split
dr0ptp4kt closed subtask T355037: Compare the performance of sparql queries between the full graph and the subgraphs as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T352538 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, Aklapper, Gehel, me, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T363721: Show "small logo or icon" as fallback image in search
dr0ptp4kt edited projects, added Wikidata; removed Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T363721 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, ChristianKl, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Amorymeltzer, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331, Jay8g, EBjune ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt claimed this task. dr0ptp4kt added a comment. Thanks @RKemper ! These speed gains are welcome news. We should discuss in a near future meeting if there are any further actions. I can see how we may want to set the bufferCapacity to 10**0** for imports, whereas we may want to just continue running with a bufferCapacity of 10 once a node is in serving mode, but good topic for discussion. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a comment. Mirroring comment in T359062#9783010 <https://phabricator.wikimedia.org/T359062#9783010>: > And for the second run in T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> we saw that this took about 3089 minutes, or about 2.**15** days, for the scholarly article entity graph with the CPU governor change (described in T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> ) plus the bufferCapacity at 10**0** on wdqs2023. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. On the gaming-class 2018 desktop, although the `bufferCapacity` value at 10**0** sped things up as described on this here ticket, application of the CPU governor change did not seem to have any additional bearing (it took 2.47 days as compared to its previous record of 2.44). It's possible that the existing BIOS configuration of the gaming-class 2018 desktop (which was already set to a high performance mode) was already squeezing out optimal performance, for example, or something else about the processor architecture's interaction with the rest of the hardware and operating system is just different as contrasted with the data center server. In any case, it's nice to see that the data center server is faster! One of my theories is that the gaming class desktop with 64GB of total RAM may play some role, but the hardware provider has indicated that although more memory can be installed, it will only run with 64GB RAM and can't jump to 128GB RAM. Another is that perhaps the default memory swappiness (60) on the gaming class desktop could play a role. However, I find this less likely, as memory spikes haven't seemed to be a problem on this machine while loading data, plus the hard drive is an NVMe and so paging is somewhat less likely to manifest problematically anyway. Maybe something to check another day, as we use a swappiness of 0 in the data center generally as with the WDQS hosts. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. And for the second run in T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> we saw that this took about 3089 minutes, or about 2.**15** days, for the scholarly article entity graph with the CPU governor change (described in T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> ) plus the bufferCapacity at 10**0** on wdqs2023. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a comment. In T362920#9776418 <https://phabricator.wikimedia.org/T362920#9776418>, @RKemper wrote: > @dr0ptp4kt > >> we saw that this took about 3702 minutes, or about 2.57 //hours// > > Typo you'll want to fix here and in the original: 2.57 **days** I think this is what is referred to as wishful thinking! Okay, updated the comment in the other ticket's comment and in the comment up above. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a comment. Mirroring comment in T359062#9775908 <https://phabricator.wikimedia.org/T359062#9775908>: > In T362920 <https://phabricator.wikimedia.org/T362920>: Benchmark Blazegraph import with increased buffer capacity (and other factors) we saw that this took about 3702 minutes, or about 2.57 hours, for the scholarly article entity with the CPU governor change (described in T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> ) alone on wdqs2023. The count matches T359062#9695544 <https://phabricator.wikimedia.org/T359062#9695544>. select (count(*) as ?ct) where {?s ?p ?o} 7643858078 TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. In T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> we saw that this took about 3702 minutes, or about 2.57 hours, for the scholarly article entity with the CPU governor change (described in T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> ) alone on wdqs2023. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a comment. Another thing that can be nice for figuring out stuff later is to add some timing and a simple log file. A command like the following was helpful when I was trying this out on the gaming-class desktop (you may not need this if your tmux session lets you scroll back really far, but it's kind of nice for tailing even without tmux). date | tee loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2>&1 | tee -a loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol 2>&1 | tee -a loadData.log TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Dringsim, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a comment. @RKemper I think that's captured in P54284 <https://phabricator.wikimedia.org/P54284> . If you need to get a copy of the files, there's a pointer in T350106#9381611 <https://phabricator.wikimedia.org/T350106#9381611> for how one might go about copying from HDFS to the local filesystem and then there's other stuff in the rest of the ticket about the data transfer. I kept a copy of the files at `stat1006:/home/dr0ptp4kt/gzips/nt_wd_schol` so those should be ready to be copied over if that helps at all. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, RKemper, bking, Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Dringsim, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt added a project: Wikidata. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)
dr0ptp4kt renamed this task from "Benchmark Blazegraph import with increased buffer capacity" to "Benchmark Blazegraph import with increased buffer capacity (and other factors)". TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, dr0ptp4kt, AWesterinen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity
dr0ptp4kt created this task. dr0ptp4kt added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION In T359062: Assess Wikidata dump import hardware <https://phabricator.wikimedia.org/T359062> there's compelling evidence that increasing buffer capacity for import, that is to say updating RWStore.properties <https://gerrit.wikimedia.org/g/operations/puppet/+/3038e3b156c743c986d2f032a9810272138da9e2/modules/query_service/templates/RWStore.common.properties.erb#26> for a value of `com.bigdata.rdf.sail.bufferCapacity=100`, leads to a material performance improvement, as observed on a gaming-class desktop. This task is to request that we soon verify on a WDQS node in the data center, preferably ahead of any further imports with changed graph split definitions. At this point it seems clear that CPU speed, disk speed, and the buffer capacity make a meaningful difference in import time. Proposed: Using the `scholarly_articles` split files, on wdqs2024, run imports as follows. 1. With the CPU performance governor configuration applied as described in T336443#9726600 <https://phabricator.wikimedia.org/T336443#9726600> and with the existing default `RWStore.properties` configuration (which will have `com.bigdata.rdf.sail.bufferCapacity=10`, note this is 100_000). This will let us better understand for the R450 <https://phabricator.wikimedia.org/diffusion/EPRO/> setup if the performance benefits for the performance governor configuration (sort of an analog of a faster processor like what we've seen with a gaming-class desktop) extend to this bulk ingestion routine. We could compare against results from T350465#9405888 <https://phabricator.wikimedia.org/T350465#9405888> . 2. Then, still with the CPU performance governor configuration in place, using a RWStore.properties with a value of `com.bigdata.rdf.sail.bufferCapacity=100` (note this is 1_000_000). This will let us verify that for this hardware class the performance benefits are further extended. 3. If and when a high speed NVMe is installed onto wdqs2024 (T361216), with both the CPU performance governor and higher buffer capacity pieces in place. This will let us verify that for this hardware class the performance benefits are even further extended. We had used wdqs**1**02**4** for the main graph ("non-scholarly") import before, and note the request here is to do the scholarly article graph import on wdqs202**4**. This is mainly because we have an NVMe request in flight for it. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, dr0ptp4kt, AWesterinen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter
dr0ptp4kt added a comment. **Running time** Total Uptime: 55 min This was faster than in T347989#9335980 <https://phabricator.wikimedia.org/T347989#9335980>. Nice! **Counts** To be discussed in code review. **Samples ** These look similar to about what we'd expect based on T347989#9346038 <https://phabricator.wikimedia.org/T347989#9346038> . select "| " || concat_ws(" | ", subject, predicate, object, context) from dr0ptp4kt.wikibase_rdf_scholarly_split_t362060 where snapshot = '20231016' and wiki = 'wikidata' and scope = 'scholarly_articles' and rand() <= (30/7643858365) distribute by rand() sort by rand() limit 30; {icon graduation-cap} | subject | predicate| object | context | | --- | | -- | | | http://www.wikidata.org/entity/statement/Q46815762-E3F8B9BE-32CC-4055-9097-0732A1D7E88E | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://wikiba.se/ontology#BestRank | http://www.wikidata.org/entity/Q46815762 | | http://www.wikidata.org/reference/c2c805e274b6709d71ffd08402ed14a95ddc0f48 | http://www.wikidata.org/prop/reference/P248 | http://www.wikidata.org/entity/Q180686 | http://wikiba.se/ontology#Reference | | http://www.wikidata.org/entity/Q93646519 | http://schema.org/description| "1985\u5E74\u306E\u8AD6\u6587"@ja | http://www.wikidata.org/entity/Q93646519 | | http://www.wikidata.org/entity/Q82929879 | http://wikiba.se/ontology#sitelinks | "0"^^http://www.w3.org/2001/XMLSchema#integer | http://www.wikidata.org/entity/Q82929879 | | http://www.wikidata.org/reference/698fdc9c32c9033280837148dd0cc2fbb09701b6 | http://www.wikidata.org/prop/reference/P248 | http://www.wikidata.org/entity/Q229883 | http://wikiba.se/ontology#Reference | | http://www.wikidata.org/entity/statement/Q37398018-08548343-257C-43E8-8768-1B82B012B857 | http://www.w3.org/ns/prov#wasDerivedFrom | http://www.wikidata.org/reference/1312ec06258ac7841e5e97d5b1d85cc034da666b | http://www.wikidata.org/entity/Q37398018 | | http://www.wikidata.org/entity/statement/Q38261165-38825DC4-B1CA-4102-8CCE-2B4713882EED | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q38261165 | | http://www.wikidata.org/entity/statement/Q50247650-2B75A590-C865-4CD7-8E93-C5720E77B459 | http://www.wikidata.org/prop/statement/P31 | http://www.wikidata.org/entity/Q13442814 | http://www.wikidata.org/entity/Q50247650 | | http://www.wikidata.org/entity/statement/Q56638632-3EEB814A-C402-48D4-9577-B91996287EDD | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q56638632 | | http://www.wikidata.org/entity/statement/Q93198245-A9EF6F3A-AE60-4B68-9ADF-03861F92E7D2 | http://www.w3.org/ns/prov#wasDerivedFrom | http://www.wikidata.org/reference/c40456cccbdf1b0dbf4590fad9ace45a270e3af6 | http://www.wikidata.org/entity/Q93198245 | | http://www.wikidata.org/entity/statement/Q35798201-73FA43B1-DE81-4AB8-84A1-435A776AFBF8 | http://www.wikidata.org/prop/statement/P50 | http://www.wikidata.org/entity/Q55071316 | http://www.wikidata.org/entity/Q35798201 | | http://www.wikidata.org/entity/statement/Q46675214-E205C68E-FD35-4F3B-99F6-CEF31C772C1E | http://www.wikidata.org/prop/qualifier/P1545 | "2" | http://www.wikidata.org/entity/Q46675214 | | http://www.wikidata.org/entity/statement/Q40608211-C59EE5EA-2F96-47C2-AE41-7EBEB83583F5 | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q40608211 | | http://ww
[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter
dr0ptp4kt added a comment. I kicked off a run using the current version of the patch with the following command and backing table, and its status should be able to be followed here: https://yarn.wikimedia.org/cluster/app/application_1713178047802_16409 So long as I haven't made an error somewhere in here that produces a runtime exception (e.g., pathing), we should be able to see after a couple hours how it's going. spark3-submit --master yarn --driver-cores 2 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.sql.shuffle.partitions=512 --conf spark.executor.memoryOverhead=4g --executor-cores 4 --executor-memory 12g --driver-memory 16g --name scholarly_article_split_manual__scholarly_article_split_triples__T362060_personal_namespace --conf spark.yarn.maxAppAttempts=1 --class org.wikidata.query.rdf.spark.transform.structureddata.dumps.ScholarlyArticleSplit --deploy-mode cluster /home/dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies-T362060.jar --input-table-partition-spec discovery.wikibase_rdf_t337013/date=20231016/wiki=wikidata --output-table-partition-spec dr0ptp4kt.wikibase_rdf_scholarly_split_T362060/snapshot=20231016/wiki=wikidata Here was the manual table creation I did while `use`ing the `dr0ptp4kt` namespace. CREATE TABLE IF NOT EXISTS dr0ptp4kt.wikibase_rdf_scholarly_split_T362060 ( `subject` string, `predicate` string, `object` string, `context` string ) PARTITIONED BY ( `snapshot` string, `wiki` string, `scope` string ) STORED AS PARQUET LOCATION 'hdfs://analytics-hadoop/user/dr0ptp4kt/wikibase_rdf_scholarly_split_T362060/wikidata/rdf_scholarly_split_T362060/' ; TASK DETAIL https://phabricator.wikimedia.org/T362060 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, S8321414, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Good news. With the N-triples style scholarly entity graph files, with a buffer capacity of 10**0**, a write retention queue capacity of 4000, and a heap size of 31g, on the gaming-class desktop, it took about 2.40 days. Recall that with buffer capacity of 10 it took about 3.25 days on this desktop (and again, recall that it was 5.875 days on wdqs1024). So, there was about a 35% (1.35 minus 1) speed increase with the higher buffer capacity here on this gaming-class desktop. It appears then that the combination of faster CPU, NVMe, and a higher buffer capacity is somewhere around 144% (5.875 / 2.40 = 2.44, 2.44 minus 1 = 1.44) faster than what we observed on a target data center machine. It will likely be somewhat less dramatic on 10B triples if the previous munged file runs are any clue. I'm going to think on how to check this notion - it could be done by using the scholarly graph plus a portion of the main graph, which would be probably close enough for our purposes. A high speed NVMe is in the process of being acquired so that we can verify on wdqs2024 the level of speedup achieved on a server similar to what was used for the graph split test servers. wdqs2024 has a hardware profile similar to wdqs1024 at present. Some stuff from the terminal from the import on the gaming-class desktop: ubuntu22:~$ head -9 ~/rdf/dist/target/service-0.3.138-SNAPSHOT/loadData.log Sun Apr 7 12:03:19 PM CDT 2024 Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=64069ms, elapsed=64024ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=71897ms, commitTime=1712509470732, mutationCount=7349689Sun Apr 7 12:04:31 PM CDT 2024 Processing part-1-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz # screen output at the end: Processing part-01023-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=51703ms, elapsed=51703ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=181013ms, commitTime=1712716306763, mutationCount=7946575Tue Apr 9 09:31:50 PM CDT 2024 File /mnt/firehose/split_0/nt_wd_schol/part-01024-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz not found, terminating real3447m18.542s TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Update: With the buffer capacity at 10**0**, file number 550 of the scholarly graph was imported as of `Mon Apr 8 03:22:08 PM CDT 2024` . So, under 28 hours so far (buffer capacity at 10 was more than 36 hours). Processing part-00550-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=51018ms, elapsed=51018ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=245278ms, commitTime=1712607725882, mutationCount=7414497Mon Apr 8 03:22:08 PM CDT 2024 Will update when it completes. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T361246: scap deploy should not repool a wdqs node that is depooled
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361246 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361935 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, dr0ptp4kt, pfischer, dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T361950: Ensure that WDQS query throttling does not interfere with federation
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361950 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, Aklapper, dcausse, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T362060 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T361114: Alert Search Platform and/or DPE SRE when Wikidata is lagged
dr0ptp4kt set the point value for this task to "2". TASK DETAIL https://phabricator.wikimedia.org/T361114 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Lucas_Werkmeister_WMDE, dcausse, Aklapper, bking, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. With bufferCapacity at 10**0**, kicked it off again with the scholarly article entity graph files: ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date | tee loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2>&1 | tee -a loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol 2>&1 | tee -a loadData.log Sun Apr 7 12:03:19 PM CDT 2024 Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Update. On the gaming-class machine it took about 3.25 days to import the scholarly article entity graph, using a buffer capacity of 10 (compare this with 5.875 days on wdqs1024 <https://phabricator.wikimedia.org/T350465#9405888>). This resulted in 7_643_858_078 triples as expected. Next up will be with a buffer capacity of 10**0** to see if there is any obvious difference in import time. >Sun Apr 7 03:34:59 AM CDT 2024 Processing part-01023-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=181901ms, elapsed=181901ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=226511ms, commitTime=1712479122009, mutationCount=7946575Sun Apr 7 03:38:46 AM CDT 2024 File /mnt/firehose/split_0/nt_wd_schol/part-01024-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz not found, terminating real4684m49.905s TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Just updating on how far along this run is, file 550 of the scholarly article entity side of the graph is being processed. There are files 0 through 1023 for this side of the graph. Note that I did think to `tee` output this time around so that generally/hopefully there's more info available to review output, stack traces (although hopefully there are none), and so on, should it be needed. Processing part-00549-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=299675ms, elapsed=299675ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=392531ms, commitTime=1712329890306, mutationCount=7032172Fri Apr 5 10:11:32 AM CDT 2024 Processing part-00550-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz Sidebar: the "non"-scholarly article entity graph also has files 0-1023 and is similarly sized in terms of triples, but naturally the manner in which nodes are interconnected varies in a sense because of the type of entities, what kind of data entities are imbued with, and so on. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Following roughly the procedure in P54284 <https://phabricator.wikimedia.org/P54284> to rename the Spark-produced graph files (and updating `loadData.sh` with `FORMAT=part-%05d-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz` and still having a `date` call after each `curl` in it), I kicked off an import of the scholarly article entity graph like so to see how it goes with a buffer capacity of 10: ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2>&1 | tee loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol 2>&1 | tee -a loadData.log Wed Apr 3 09:32:54 PM CDT 2024 Processing part-0-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=55629ms, elapsed=55584ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=61598ms, commitTime=1712198035155, mutationCount=7349689Wed Apr 3 09:33:56 PM CDT 2024 real1m1.702s user0m0.004s sys 0m0.006s Processing part-1-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=61251ms, elapsed=61251ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=71925ms, commitTime=1712198106800, mutationCount=7774048Wed Apr 3 09:35:08 PM CDT 2024 Processing part-2-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz This is with the following values in `RWStore.properties` com.bigdata.btree.writeRetentionQueue.capacity=4000 com.bigdata.rdf.sail.bufferCapacity=10 and the following variable in `loadData.sh` HEAP_SIZE=${HEAP_SIZE:-"31g"} TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. This morning of April 3 around 6:25 AM I had SSH'd to check progress, and it was working, but going slowly, similar to the day before. It was on a file number in the 1200s, but I didn't write down the number or copy terminal output; I do remember seeing it was taking around 796 seconds for one of the files at that time. Look at the previous comment, you'll see those were going slow; not surprising as we know imports on these munged files are slower upon more stuff is imported. I checked several hours later in the middle of a meeting, and it had gone into a bad spiral. I've been able to use `screen` backscrolling to obtain much of the stack trace, but could not backscroll to a point of having all of the information to tell when the last successful file imported without a stack trace for sure. What we can say is that //probably// the last somewhat stable commit was on file 1302 at about 7:24 AM. And probably file 1303 and definitely 1304 and 1305 have been failing badly and taking a really long time in doing so; this would probably continue indefinitely from here without killing the process. Just a slice of the paste here to give an idea of things (notice `lastCommitTime` and `commitCounter` in the stack trace). Wed Apr 3 02:05:26 PM CDT 2024 Processing wikidump-01305.ttl.gz SPARQL-UPDATE: updateStr=LOAD java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -83289912769511002: lastRootBlock=rootBlock{ rootBlock=0, challisField=1302, version=3, nextOffset=47806576684846562, localTime=1712147044389 [Wedne sday, April 3, 2024 7:24:04 AM CDT], firstCommitTime=1711737574896 [Friday, March 29, 2024 1:39:34 PM CDT], lastCommitTime=1712147041973 [Wednesday, April 3, 2024 7:24:01 AM CDT], commitCounter=1302, commitRecordAddr={off=NATIVE:-140859033,len=422}, commitRecordIndexAddr={off=NATIVE:-93467508,len=220}, blockSequence=34555, quorumToken=-1, metaBitsAddr=26754033649714513, metaStartAddr=11989126, storeType=RW, uuid=f993598d-497c-46a7-8434-d25c8859a0b8, offsetBits=42, checksum=16003356 92, createTime=1711737574192 [Friday, March 29, 2024 1:39:34 PM CDT], closeTime=0} Unfortunately `jstack` seems to hiccup. ubuntu22:~$ sudo jstack -m 987870 [sudo] password: Attaching to process ID 987870, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.402-b06 Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sun.tools.jstack.JStack.runJStackTool(JStack.java:140) at sun.tools.jstack.JStack.main(JStack.java:106) Caused by: java.lang.RuntimeException: Unable to deduce type of thread from address 0x7fecb400b800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread) at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:169) at sun.jvm.hotspot.runtime.Threads.first(Threads.java:153) at sun.jvm.hotspot.tools.PStack.initJFrameCache(PStack.java:200) at sun.jvm.hotspot.tools.PStack.run(PStack.java:71) at sun.jvm.hotspot.tools.PStack.run(PStack.java:58) at sun.jvm.hotspot.tools.PStack.run(PStack.java:53) at sun.jvm.hotspot.tools.JStack.run(JStack.java:66) at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260) at sun.jvm.hotspot.tools.Tool.start(Tool.java:223) at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118) at sun.jvm.hotspot.tools.JStack.main(JStack.java:92) ... 6 more Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x7fecb400b800 at sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62) at sun.jvm.hotspot.runtime.VirtualConstructor.instantiateWrapperFor(VirtualConstructor.java:80) at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:165) ... 16 more ubuntu22:~$ sudo jstack -Flm 987870 Usage: jstack [-l] (to connect to running process) jstack -F [-m] [-l] (to connect to a hung process) jstack [-m] [-l] (to connect to a core file) jstack [-m] [-l] [server_id@] (to connect to a remote debug server) Options: -F to force a thread dump. Use when jstack does not respond (process is hung) -m to print both java and native frames (
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Now this is interesting: we're now past 4 days (about 4 days and 1 hour) of this running, and with buffer capacity at 10 instead of 10**0** (but this time without any gap between the batches of files), there's still a good way to go yet. Processing wikidump-01177.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=612796ms, elapsed=612796ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=689208ms, commitTime=1712085811545, mutationCount=12297407Tue Apr 2 02:23:35 PM CDT 2024 Processing wikidump-01178.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=850122ms, elapsed=850121ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=950693ms, commitTime=1712086762086, mutationCount=16659867Tue Apr 2 02:39:26 PM CDT 2024 Processing wikidump-01179.ttl.gz It's possible this means that a higher buffer capacity actually makes a difference. I will let this run complete so we can see what is the percentage difference. After this I will check if this sort of behavior is reproducible, and to what extent, with one side of the graph split when using these two different buffer sizes. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. The run with with buffer at 10**0** and heap size at 31g and queue capacity at 4000 on the gaming-class desktop completed. Processing wikidump-01332.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13580ms, elapsed=13580ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=266483ms, commitTime=1711304860167, mutationCount=4772590Sun Mar 24 01:27:45 PM CDT 2024 real5690m30.371s ... which is 3.95 days. I'm trying again, but going back to the buffer capacity at 10 instead of 10**0** for one last comparison with these runs on this subset of munged data, and without any larger pause between batches of files (remember the previous run with buffer capacity at 10 and 31g heap and queue capacity at 4000 was done by first running files 1-150, then after coming back to the terminal sometime later resumed from file 151; but in the real world we usually hope to just let this thing run one file after another without any pause...in practice it could be that allowing the JVM time to heal itself created some artificial speed gains, but we'll see). Starting on Friday, March 29, 2024 at 1:40 PM CT... ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 1332 Processing wikidump-1.ttl.gz I'll update when it's done. It should complete presumably sometime in the next 24 hours. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. **AWS EC2 servers** After exploring a battery of EC2 servers, four instance types were selected and the commands posted were run. The configuration most like our `wdqs1021-1023` servers (third generation Intel Xeon) is listed first. The fastest option among the four servers was a Gravitron3 ARM-based configuration from Amazon. | Time Disk ➡️ Disk | Time RAMdisk ➡️ RAMdisk | Instance Type | Cost Per Hour | HD Transfer | Processor Comment | RAM Comment| | - | --- | -- | - | --- | --- | -- | | 26m46.651s| 26m26.923s | m6id <https://aws.amazon.com/ec2/instance-types/m6i/>.16xlarge | $3.7968 | EBS ➡️ NVMe | 64 vCPU @ "Up to 3.5 GHz 3rd Generation Intel Xeon Scalable processors (Ice Lake 8375C)" | 256 GB @ DDR4 | | 22m5.442s | 20m31.244s | m5zn <https://aws.amazon.com/ec2/instance-types/m5/>.metal | $3.9641 | EBS ➡️ EBS | 48 vCPU @ ""2nd Generation Intel Xeon Scalable Processors (Cascade Lake 8252C) with an all-core turbo frequency up to 4.5 GHz"" | 192 GiB @ DDR4 | | 21m40.537s| 20m57.268s | c5d <https://aws.amazon.com/ec2/instance-types/c5/>.12xlarge | $2.304| EBS ➡️ NVMe | 48 vCPU @ " C5 <https://phabricator.wikimedia.org/C5> and C5d 12xlarge, 24xlarge, and metal instance sizes feature custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake 8275CL) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz." | 96 GiB @ DDR4 | | 19m18.825s| 19m23.868s | c7gd <https://aws.amazon.com/ec2/instance-types/c7g/>.16xlarge | $2.903| EBS ➡️ NVMe | 64 vCPU @ "Powered by custom-built AWS Graviton3 processors" | 128 GiB @ DDR5 | | **2018 gaming desktop** Commands were then run against a a gaming-class desktop from 2018. This outperformed the fastest Gravitron3 configuration in AWS. The Blazegraph `bufferCapacity` configuration variable was tested. Increasing the `bufferCapacity` from 10 to 100 yielded a sizable performance improvement. | Time Disk ➡️ Disk | Instance Type | bufferCapacity | HD Transfer | Processor Comment | RAM Comment | | - | --- | -- | - | - | | | 18m31.647s| Alienware Aurora R7 <https://www.bestbuy.com/site/alienware-aurora-r7-gaming-desktop-intel-core-i7-8700-16gb-memory-nvidia-gtx-1070-1tb-hdd-intel-optane-memory/6155310.p?skuId=6155310> (upgraded) i7-8700 | 10 | SATA SSD ➡️ NVMe | 6 CPU @ up to 4.6 GHz (i7-8700 <https://ark.intel.com/content/www/us/en/ark/products/126686/intel-core-i7-8700-processor-12m-cache-up-to-4-60-ghz.html> page) | 64 GB @ DDR4 | | 18m3.798s | Alienware Aurora R7 <https://www.bestbuy.com/site/alienware-aurora-r7-gaming-desktop-intel-core-i7-8700-16gb-memory-nvidia-gtx-1070-1tb-hdd-intel-optane-memory/6155310.p?skuId=6155310> (upgraded) i7-8700 | 10 | NVMe ➡️ same NVMe | 6
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. By the way, I'm attempting a run for the first 1332 munged files (one shy of the 1333 where terminated last time around) with buffer at 10**0** and heap size at 31g and queue capacity at 4000 on the gaming-class desktop to see whether this imports smoothly and whether performance gains are noticeable. ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date Wed Mar 20 02:36:59 PM CDT 2024 ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 1332 ...screen'ing in to check: Processing wikidump-00505.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13452ms, elapsed=13452ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=167329ms, commitTime=1711041930967, mutationCount=4566497Thu Mar 21 12:25:35 PM CDT 2024 Processing wikidump-00506.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=15405ms, elapsed=15405ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=203202ms, commitTime=1711042135111, mutationCount=5262167Thu Mar 21 12:28:58 PM CDT 2024 Processing wikidump-00507.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=14701ms, elapsed=14700ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=178754ms, commitTime=1711042314114, mutationCount=5005853Thu Mar 21 12:31:57 PM CDT 2024 TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. The run to check with heap size of 31g, queue capacity of 8000, and buffer at 10**0** stalled at file 107. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Attempting a run with a **queue capacity of 8000** and buffer of 10**0** and heap size of 16g on the gaming-class desktop to mimic the MacBook Pro, things were slower than a queue capacity of 4000 and buffer of 100 and heap size of 31g on the gaming-class desktop <https://phabricator.wikimedia.org/T359062#9643972>. ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150 ... real280m46.264s A run is in progress to verify if there's anything noticeable when the heap size is set to 31g but the queue capacity is at 8000 and the buffer is at 10**0** when processing the first 150 files on the gaming-class desktop. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. **About Amazon Neptune** Amazon Neptune was set to import using the simpler N-Triples file format with its serverless configuration at 128 NCUs (about 256 GB of RAM with some attendant CPU). We don't use N-Triples files in our existing import process, but it is the sort of format used in the graph split imports. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1.cluster-cnim20k6c0mh.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "HIGH", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' This required a bunch of grants, and I had to make my personal bucket hosting the file listable and readable, as well as the objects listable and readable within it (it's possible to do chained IAM grants, but it is a bit of work and requires somewhat complicated STSes). It appeared that it was also necessary to create the VPC endpoint as described in the documentation. This was started at 1:30 PM CT on Monday, February 26, 2024. Note that this is the lexemes dump. I'm trying here to verify that with 128 NCUs it goes faster than with 32 NCUs. Because if it does, that will be useful for the bigger dump. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "OVERSUBSCRIBE", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' { "status" : "200 OK", "payload" : { "loadId" : "8ace45ed-2989-4fd4-aa19-d13b9a59e824" } curl -G 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/8ace45ed-2989-4fd4-aa19-d13b9a59e824' { "status" : "200 OK", "payload" : { "feedCount" : [ { "LOAD_COMPLETED" : 1 } ], "overallStatus" : { "fullUri" : "s3://blazegraphdump/latest-lexemes.nt.bz2", "runNumber" : 1, "retryNumber" : 0, "status" : "LOAD_COMPLETED", "totalTimeSpent" : 2142, "startTime" : 1708975752, "totalRecords" : 163715491, "totalDuplicates" : 141148, "parsingErrors" : 0, "datatypeMismatchErrors" : 0, "insertErrors" : 0 } } } Now, for the full Wikidata load. This was started at about 2:20 PM CT on Monday, February 26, 2024. curl -v -X POST \ -H 'Content-Type: application/json' \ https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader -d ' { "source" : "s3://blazegraphdump/latest-all.nt.bz2", "format" : "ntriples", "iamRoleArn" : "arn:aws:iam::ACCOUNTID:role/NeptuneLoadFromS3Attempt", "region" : "us-west-2", "failOnError" : "FALSE", "parallelism" : "OVERSUBSCRIBE", "updateSingleCardinalityProperties" : "FALSE", "queueRequest" : "TRUE" }' { "status" : "200 OK", "payload" : { "loadId" : "54dc9f5a-6e3c-428d-8897-180e10c96dbf" } curl -G 'https://db-neptune-1-instance-1.cwnhpfsf87ne.us-west-2.neptune.amazonaws.com:8182/loader/54dc9f5a-6e3c-428d-8897-180e10c96dbf' As a frame of reference, over 9B records imported in a a bit over 26 hours. This is in the ball
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. **Going for the full import** Further import commenced from there with a `bufferCapacity` of 10**0**: ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date Mon Mar 4 06:31:06 PM CST 2024 ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 151 -e 2202 Processing wikidump-00151.ttl.gz Munge files 151 through 1333 were processed, stopping at Friday, March 8, 2024 12:07:23 AM CST. So, we have about 4 hours for files 1-150, then another 77.6 hours for files 151-1333. This means about 66% of the full dump was processed in about 3.5 days. As noted earlier, there may be an opportunity to set the queue capacity higher and squeeze out even better performance. This will need to wait until I'm physically at the gaming class desktop. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. **More about bufferCapacity** Similarly, with 150 munged files, was attempted with the buffer in RWStore.properties increased from 10 to 10**0** with the target as the NVMe. com.bigdata.rdf.sail.bufferCapacity=100 ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150 ... real240m5.344s Remember, for //nine// munged files the difference in performance for NVMe ➡️ same NVMe between the `bufferCapacity` of 10 versus about 10*0* was about 34%. (~1.3412 minus 1.), and what we see here for //150// munged files is somewhat consistent at about 33%. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. **More about NVMe versus SSD** Runs were also done to see the effects on 150 munged files (out of a set of 2202 files) from the full Wikidata import, which allows for exercising more disk related pieces. This was tried with both types of target disk - SATA SSD and M.2 NVMe - on the 2018 gaming desktop. This was done with the `bufferCapacity` of 10. The M.2 NVMe was faster, somewhere between 16%-19% faster. Notice in the following commands the paths - `~/rdf`, which is part of a mount on the NVMe - `/mnt/t`, which is a copy of `~/rdf`, but on a SATA SSD - `/mnt/firehose/`, yet another SATA SSD, bearing the full set of munged files **Target is NVMe** ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150 ... >Processing wikidump-00150.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=33999ms, elapsed=33999ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=76005ms, commitTime=1709099819611, mutationCount=3098484 real319m50.828s **Target is SATA SSD, run attempt 1** Now, the SATA SSD as the target (as before, the source has been a different SATA SSD). ubuntu22:/mnt/t/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150 >Processing wikidump-00150.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=45665ms, elapsed=45665ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=114606ms, commitTime=1709141576293, mutationCount=3098484 real381m19.703s So, the SATA SSD as target yielded a result about 19% slower. **Target is SATA SSD, run attempt 2** The SATA SSD target was tried this again from the same directory (as always, first stopping Blazegraph and deleting the journal) again just to get a feeling of whether this wasn't a fluke on the SATA SSD. ubuntu22:/mnt/t/rdf/dist/target/service-0.3.138-SNAPSHOT$ time ./loadData.sh -n wdq -d /mnt/firehose/munge_on_later_data_set -s 1 -e 150 >totalElapsed=46490ms, elapsed=46490ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=120472ms, commitTime=1709169683880, mutationCount=3098484 real373m52.079s Still, some 16.5% slower on the SSD. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a subscriber: ssingh. dr0ptp4kt added a comment. @ssingh would you mind if the following command is run on one of the newer cp hosts with a new higher write throughput NVMe? If so, got a recommended node? I don't have access, but I think @bking may. `sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M count=1024; sudo sync` Heads up, I'm out for the rest of the day. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: ssingh, bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. Thanks @bking ! It looks like the NVMe in this one is not a higher speed one for writes, and I'm also wondering if perhaps its write performance has degraded with age. I'll paste in the results here, but this was slower than the other servers, ironically (although not surprisingly because of the slower NVMe and slightly slower processor). This slower write speed is atypical of the other NVMes I've encountered. I believe the newer model ones are rated for 6000 MB/s for writes. But, I'm going to ping on task to see if we can get a comparative read of disk throughput from one of the newer and faster cp NVMes. dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache$ ls /srv/wdqs/ aliases.map wikidata.jnl wikidump-2.ttl.gz wikidump-4.ttl.gz wikidump-6.ttl.gz wikidump-8.ttl.gz dumpswikidump-1.ttl.gz wikidump-3.ttl.gz wikidump-5.ttl.gz wikidump-7.ttl.gz wikidump-9.ttl.gz dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache$ cd cache dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache/cache$ time ./loadData.sh -n wdq -d /srv/wdqs -s 1 -e 9 Processing wikidump-1.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=214282ms, elapsed=214279ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=233942ms, commitTime=1709910647417, mutationCount=22829952Processing wikidump-2.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=196470ms, elapsed=196469ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=227786ms, commitTime=1709910874952, mutationCount=15807617Processing wikidump-3.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=183111ms, elapsed=183110ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0msCOMMIT: totalElapsed=213965ms, commitTime=1709911089170, mutationCount=12654001Processing wikidump-4.ttl.gz ^C real14m4.855s user0m0.084s sys 0m0.053s dr0ptp4kt@wdqs1025:/srv/deployment/wdqs/wdqs-cache/cache$ cd /srv dr0ptp4kt@wdqs1025:/srv$ df . Filesystem 1K-blocksUsed Available Use% Mounted on /dev/nvme0n1 1537157352 9508448 1449491832 1% /srv dr0ptp4kt@wdqs1025:/srv$ sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M count=1024; sudo sync 1024+0 records in 1024+0 records out 26843545600 bytes (27 GB, 25 GiB) copied, 27.1995 s, 987 MB/s dr0ptp4kt@wdqs1025:/srv$ sudo sync; sudo dd if=/dev/zero of=tempfile bs=25M count=1024; sudo sync 1024+0 records in 1024+0 records out 26843545600 bytes (27 GB, 25 GiB) copied, 37.5448 s, 715 MB/s dr0ptp4kt@wdqs1025:/srv$ lsblk -o MODEL,SERIAL,SIZE,STATE --nodeps MODELSERIAL SIZE STATE ... Dell Express Flash PM1725a 1.6TB SFF S39XNX0JC01060 1.5T TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt added a comment. First, adding some commands that were used for Blazegraph imports on Ubuntu 22.04. I had originally tried a good number of EC2 instance types, and then after that went back to focus on just four of them with a sequence of repeatable commands (this wasn't scripted, as I didn't want to spend time automating and also wanted to make sure I got the systems' feedback along the way). I forgot to grab RAM clock speed as a routine step when running these commands (I recall checking on one server maybe in the original checks, and did look at my Alienware), but generally servers are DDR4 unless the documentation in AWS says DDR5 (for my 2018 Alienware and 2019 MacBook Pro they're DDR4, BTW). # get the specs, get the software, ready the mount lscpu free -h lsblk sudo fdisk /dev/nvme1n1 n p 1 ENTER ENTER w lsblk sudo mkfs.ext4 /dev/nvme1n1p1 mkdir rdf sudo mount -t auto -v /dev/nvme1n1p1 /home/ubuntu/rdf sudo chown ubuntu:ubuntu rdf git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf rdfdownload cp -r rdfdownload/. rdf cd rdf df -h . sudo apt update sudo apt install openjdk-8-jdk-headless ./mvnw package -DskipTests # ready Blazegraph and run a partial import sudo mkdir /var/log/wdqs sudo chown ubuntu:ubuntu /var/log/wdqs touch /var/log/wdqs/wdqs-blazegraph.log cd /home/ubuntu/rdf/dist/target/ tar xzvf service-0.3.138-SNAPSHOT-dist.tar.gz cd service-0.3.138-SNAPSHOT/ # using logback.xml like prod: mv ~/logback.xml . # using runBlazegraph.sh like prod, 31g heap and pointer to logback.xml: mv ~/runBlazegraph.sh . vi runBlazegraph.sh screen ./runBlazegraph.sh CTRL-a-d to leave screen up time ./loadData.sh -n wdq -d /home/ubuntu/ -s 1 -e 9 screen -r CTRL-c to kill Blazegraph exit from screen ls -alh wikidata.jnl rm wikidata.jnl # try it with a ramdisk sudo modprobe brd rd_size=50331648 max_part=1 rd_nr=1 sudo mkfs -t ext4 /dev/ram0 mkdir /home/ubuntu/rdfram sudo mount /dev/ram0 /home/ubuntu/rdfram sudo chown ubuntu:ubuntu /home/ubuntu/rdfram cd cp -r rdf/. rdfram cd rdfram/dist/target/service-0.3.138-SNAPSHOT/ cp /home/ubuntu/wikidump-* /home/ubuntu/rdfram df -h ./ screen ./runBlazegraph.sh CTRL-a-d to leave screen up time ./loadData.sh -n wdq -d /home/ubuntu/rdfram -s 1 -e 9 TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)
dr0ptp4kt added a comment. @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of `wdqs1025.eqiad.wmnet`? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for `cp1086.eqiad.wmnet` or `cp1086.mgmt.eqiad.wmnet`? TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: VRiley-WMF, dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, BTullis, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Davinaclare77, Techguru.pc, Lahi, Gq86, GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt moved this task from Incoming to Current work on the Wikidata-Query-Service board. dr0ptp4kt removed a project: Wikidata-Query-Service. TASK DETAIL https://phabricator.wikimedia.org/T359062 WORKBOARD https://phabricator.wikimedia.org/project/board/891/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, AWesterinen, Namenlos314, Lucas_Werkmeister_WMDE, merbst, Jonas, Xmlizer, jkroll, Jdouglas, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, KimKelting, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware
dr0ptp4kt changed the task status from "Open" to "In Progress". dr0ptp4kt triaged this task as "Medium" priority. dr0ptp4kt claimed this task. dr0ptp4kt added projects: Wikidata-Query-Service, Discovery-Search (Current work). dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, AWesterinen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, KimKelting, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)
dr0ptp4kt added a comment. Thanks @VRiley-WMF ! @bking is up next for imaging, I think. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: VRiley-WMF, dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)
dr0ptp4kt added a parent task: T358533: Hardware requests for Search Platform FY2024-2025. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)
dr0ptp4kt added a parent task: T336443: Investigate performance differences between wdqs2022 and older hosts. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper, dr0ptp4kt, wiki_willy, bking, Wunderlandmeli, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, joanna_borun, Invadibot, Devnull, maantietaja, Muchiri124, ItamarWMDE, Akuckartz, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Hfbn0, QZanden, EBjune, KimKelting, merbst, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. I summarized at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance . When we have a mailing list post during the next week or so, we'll want to move this to be a subpage of the target page of the post. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. In T355037#9508760 <https://phabricator.wikimedia.org/T355037#9508760>, @dcausse wrote: > @dr0ptp4kt thanks! is the difference in the number of successful queries only explained by the improvement in query time or are there some improvements in the number of queries that timeout as well? Good question! It appears to be related to query time. Looking at this latest run, for example, there were no recorded timeouts according to the CSV of the IGUANA `.nt` (樂). Taking things on a head-to-head basis for identical queries between the endpoints, here's what we see for the difference in speed for `wikidata_main_graph` minus `baseline`. It's unsurprising in a way given the distribution shown in the prior Phabricator comment, but it is another way of knowing that, under this parameters of this test anyway, that about 70% of the queries noted as successful seemed to be faster when run against the `wikidata_main_graph`. Note that about 16% of the queries hit `wrongCodes` / `failed`, which are discussed after the table. | Per-query wikidata_main_graph QPS minus baseline QPS | descriptor | | | - | | 0.722596509877809 | average | | 0.244672300065055 | median | | 79.4339558877256 | 100% max (i.e., wikidata_main_graph's biggest winner) | | 21.0654641024791 | 99% | | 6.88080533343067 | 95% | | 1.38414473312972 | 75% | | 0.244672300065055 | 50% | | 0.013982881368447 | 42% | | 0| 41% | | 0| 26% | | -0.00701117502390231 | 25% | | -0.215374628998983 | 20% | | -0.598658931613195 | 15% | | -1.41867399989265 | 10% | | -4.16152316076897 | 5% | | -18.0068429593504| 1% | | -80.2800161266253| 0% min (i.e., baseline's biggest winner) | | About 58% of queries titled toward `wikidata_main_graph`, and about 25% tilted toward `baseline`, and 58/(58+25) is about 0.7. The stuff where the difference is negligible probably don't matter that much. Yet, there's a bit more detail to consider in IGUANA's conception here... For the sake of completeness, and because this may be interesting to consider later on or to contextualize the QPS distributions in the prior Phabricator comment: looking at a different class of issues, let's suppose that we use `wrongCodes`as a proxy for things that could have gone wrong.`wrongCodes` and `failed` map to each other in the CSV, and their QPSes land as 0 for these (`penalizedQPS`, not included in the tables above, lands by default as 0.017 for these records, but this is close enough to 0 if we wanted to look at it that way). These sorts of records thus drive down summary mean, median, and so on. As an aside, in terms of actual time (`totalTime`), these `wrongCodes` ones occupy very little time. | Endpoint Label | count wrongCodes | sum wrongCodes | count failed | sum failed | count timeout | count QPS < 1.0 | count QPS < 5.0 | count QPS < 20.0 | count QPS < 80.0 | count QPS < 200.0 | | --- | | -- | | -- | - | --- | --- | | | - | | baseline|
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Here's the output from the latest run based upon a larger set of queries from a random sample of WDQS queries. $ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -e result.nt > result.execution.csv $ cut -f1,3,5,6,7,9 -d"," result.execution.csv | sed 's/,/|/g' | endpointLabel | taskStartDate| successfullQueries | successfullQueriesQPH | avgqps | queryMixesPH | | --- | | -- | - | -- | | | baseline| 2024-01-31T23:20:44.567Z | 319857 | 136612.71246575614 | 18.83670491311007 | 1.732300885924224 | | wikidata_main_graph | 2024-02-01T04:23:01.613Z | 331473 | 147674.12233239523 | 19.55930142298825 | 1.8725637484770261 | | Here's the screen capture from Grafana. F41740308: Screenshot 2024-02-01 at 10.17.28 AM.png <https://phabricator.wikimedia.org/F41740308> The `wikidata_main_graph` window completed more queries despite an apparent bout of increased failing queries (climb began at about 0915 UTC), with a large garbage collection beginning about 5 minutes later (GC started at about 0920 UTC; the GC actually continued well after the `wikidata_main_graph`'s window closure at 2024-02-01T09:23:55.639Z). This isn't the most interesting thing as it only constitutes about 1.5%-3.0% of the `wikidata_main_graph` window depending on how one looks at it, and I wouldn't necessarily read anything into whether such GCs would be likely to occur under the same conditions, but I wanted to note it nonetheless. To repeat the verbiage from the earlier runs... > Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds basically with the value calculated up above). The underlying CSV data don't bear actual queries (the .nt files from which they're generated do), ... The CSV data were generated with the following command: `/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -q result.nt > result.query.csv` | Run | Endpoint Label | Mean | Median | Standard Deviation | Max (fastest) | 99% (very fast) | 0.95 | 0.75 | 0.5 | 0.25 | 1% (pretty slow) | Total w/ success | | | --- | | -- | -- | - | --- | | | | | | | | randomized 1 | baseline| 18.8367049131101 | 14.6999663404689 | 16.3589173757083 | 127.433177227691 | 59.009472115968 | 50.5734395961334 | 30.3470335487675 | 14.6999663404689 | 4.97164300568995 | 0| 319857 | | randomized 1 | wikidata_main_graph | 19.5593014229883 | 16.0982853987134 | 16.5098295290687 | 121.141149629509 | 58.9613256488317 | 51.0426872548935 | 31.751311031492 | 16.0982853987134 | 5.37249826361878 | 0| 331473 | | Although the max and 99th percentile queries were just ever so slightly faster on the baseline "full" graph, more generally things were faster on the non-scholarly "main" graph. The performance difference is obvious but not dramatic. Here's the content of `wdqs-split-test-randomized-2024-01-31.yml`, comments removed for brevity. The main difference in this configuration file from the earlier presented one is five hours allowed per graph, to accommodate a larger query mix, and the updated filename pointing to the larger query mix based on the set of queries from the random sample. datasets: - name: "split" connections: - name: "baseline" endpoint: "https://wdqs1022.eqiad.wmnet/sparql; - name: "wikidata_main_graph" endpoint: "https://wdqs1024.eqiad.wmnet/sparql; tasks: - className: "org.aksw.iguana.cc.tasks.impl.Stresstest" configuration: timeLimit: 1800 warmup: timeLimit: 3 workers: - threads: 4 className: "SPARQLWorker" queriesFile: "queries_for_performance_file_renamed_randomized_2024_01_31.txt" timeOut: 5000 queryHandler: className: "DelimInstancesQueryHandler" configuration: delim: "### BENCH DELIMITER ###" workers: - threads: 4 className: "SPARQLWorke
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. A run is in progress for 78K+ queries from a set of 100,000 random queries. It should be done in under 10 hours from now. scala> val full_random = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified.parquet") scala> val wikidata_random = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_random_classified.parquet") scala> full_random.count res0: Long = 10 scala> wikidata_random.count res6: Long = 10 scala> val joined11 = wikidata_random.as("w").join(full_random.as("f")).where("w.id = f.id and w.success = true and w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, seed=42) scala> joined11.count res0: Long = 78862 scala> joined11.repartition(1).write.option("compression", "none").text("queries_for_performance_2024_01_31.txt") scala> :quit $ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_31.txt/part-0-29c4e72d-800d-4148-b804-8e428ee71e9e-c000.txt ./queries_for_performance_file_renamed_randomized_2024_01_31.txt $ bash start-iguana.sh wdqs-split-test-randomized-2024-01-31.yml `start-iguna.sh` previously ran from `stat1006`, but this time around it's running from `stat1008` in order to use more RAM for the larger query mix. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds with the value calculated up above, with just slightly less precision). The underlying CSV data don't bear actual queries (the `.nt` files from which they're generated do), but rather rows of this form: endpointLabel,task,queryId,totalTime,success,failed,timeouts,resultSize,unknownException,wrongCodes,qps,penalizedQPS baseline,http://iguana-benchmark.eu/resource/1706221131/1/1,http://iguana-benchmark.eu/resource/1989023647/sparql0,53.592,2,0,0,1,0,0,37.319002836244216,37.319002836244216 No big surprises here. The "per-query" behavior was similar between nodes. The "main" graph skewed somewhat faster over the full range of queries with one exception: the absolute fastest singular query for "randomized 1" run was slightly faster on the "baseline" full graph...but generally, everything else skewed faster for the "main" graph otherwise. **Per-query theoretical throughput (queries per second for given query)** | Run | Endpoint Label | Mean | Median | Standard Deviation | Max (fastest) | 99% (very fast) | 0.95 | 0.75 | 0.5 | 0.25 | 1% (pretty slow) | Total w/ success | | | --- | | -- | -- | - | --- | | | | | | | | non-randomized 1 | baseline| 32.6059031135735 | 34.3489164537969 | 19.4337414434464 | 120.235661897318 | 76.8554161439261| 55.7841155056668 | 49.1222469015747 | 34.3489164537969 | 14.3159887843845 | 0.0564663314012815 | 15538 | | non-randomized 1 | wikidata_main_graph | 33.8619129716351 | 35.7193884840691 | 20.2056740376445 | 148.610491900728 | 81.2789060283893| 57.5619081064873 | 50.4922999242615 | 35.7193884840691 | 15.098897780462 | 0.0625188498260728 | 16773 | | non-randomized 2 | baseline| 32.9728451327437 | 34.7318699638788 | 19.5908672232246 | 128.890893858348 | 74.7142465127726| 56.1419267909274 | 49.7404172035179 | 34.7318699638788 | 14.6689672498449 | 0.0569581938930891 | 15893 | | non-randomized 2 | wikidata_main_graph | 34.0852093005914 | 36.106296938186 | 20.1931723422722 | 130.25921583952 | 82.0449565998977 | 57.6139754652666 | 50.625221485344 | 36.106296938186 | 15.378937007874 | 0.0622306059862422 | 16780| | randomized 1 | baseline| 32.8878633004489 | 34.6404323125952 | 19.8757923913207 | 136.072935093209 | 79.2782608462366| 56.2164107372478 | 49.4926998267755 | 34.6404323125952 | 14.1755500113404 | 0.0557895216707227 | 15180 | | randomized 1 | wikidata_main_graph | 33.9156003706814 | 35.7091844022282 | 20.2748013579631 | 132.082948091401 | 81.0501654381498| 57.747392312958 | 50.5101525406606 | 35.7091844022282 | 15.079658294943 | 0.0574330048487202 | 15929 | | randomized 2 | baseline| 33.007109052298 | 34.5670904661201 | 19.8511760316909 | 133.904659882163 | 81.4017649973028| 56.1335754963953 | 49.6176341128876 | 34.5670904661201 | 14.3154187934342 | 0.0538090222917457 | 15211 | | randomized 2 | wikidata_main_graph | 34.1402036271541 | 36.0958706323996 | 20.2577595936201 | 134.156157767641 | 83.0239310934627| 57.7850982363029 | 50.5292946486809 | 36.0958706323996 | 15.7122834258313 | 0.0589775584599512 | 16084 | | TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Here were the data produced by IGUANA once piped through the CSV utility introduced in https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs with a command of the following form (for the attentive reader, note that I had to rename the originally produced files to have an `.nt` extension to make the underlying Jena libraries not throw an exception). `/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -e result.003.nt > result.003.execution.csv` | run | endpointLabel | taskStartDate| successfullQueries | successfullQueriesQPH | avgqps | queryMixesPH | | | --- | | -- | - | -- | | | non-randomized 1 | baseline| 2024-01-25T22:18:57.753Z | 15538 | 17512.446990539123 | 32.60590311357346 | 0.9895715087607575 | | non-randomized 1 | wikidata_main_graph | 2024-01-25T23:19:56.948Z | 16773 | 19125.484555828807 | 33.86191297163505 | 1.0807190233276154 | | non-randomized 2 | baseline| 2024-01-26T01:47:41.634Z | 15893 | 17955.609618256018 | 32.97284513274341 | 1.0146131897076351 | | non-randomized 2 | wikidata_main_graph | 2024-01-26T02:48:41.047Z | 16780 | 19145.810254441058 | 34.085209300591515 | 1.0818675625496446 | | randomized 1 | baseline| 2024-01-26T16:51:54.091Z | 15180 | 17068.107622599186 | 32.88786330044905 | 0.9644633340452725 | | randomized 1 | wikidata_main_graph | 2024-01-26T17:52:52.903Z | 15929 | 17969.809300477013 | 33.91560037068121 | 1.0154155676372838 | | randomized 2 | baseline| 2024-01-26T19:37:30.811Z | 15211 | 17054.882354485933 | 33.00710905229813 | 0.9637160170924978 | | randomized 2 | wikidata_main_graph | 2024-01-26T20:38:29.989Z | 16084 | 18210.142239149543 | 34.14020362715409 | 1.0289960015341326 | | Keep in mind that a delay between was introduced in the configuration for these "stress tests" (a "stress test" here means that the execution of the queries goes continuously for the specified time interval at its concurrency and delay spec). This was to more closely model what a somewhat busy, but not completely saturated, WDQS node might experience, although we should be mindful that the server specs are a bit different between these test servers and the WDQS hosts used for serving end user WDQS production requests. When interpreting a value like `avgqps`, remember that this is akin to what might happen if queries were executed serially without delay if it were possible to hold JVM performance constant for such request patterns (do note that this is generally not possible to guarantee, so caveats abound; in other words it's entirely possible that `avgqps` could degrade in reality). The `successfulQueriesQPH` metric is probably the most interesting one. It's suggestive of about a 5%-10% speed advantage for the smaller "main" graph versus a fully populated "full" graph for this query mix when conditions model a somewhat busy WDQS node (again, remember that server spec is a bit different between the SUT and production nodes so there is a caveat). Additional basic summary statistics upon the data from with per-query CSV exports (using the `-q` flag) against the `.nt` files to come. Note that in Andrea's previous analysis these sorts of statistics (as well as some tweaks to get somewhat finer precision via `BigDecimal` instead of `Double` types) were incorporated directly into the Java source of IGUANA - see https://github.com/dice-group/IGUANA/compare/main...AndreaWesterinen:IGUANA:main for changes up to June 13, 2022 against current main branch of IGUANA; n.b., to future readers you may need to re-correlate the code changes when IGUANA upstream changes. But, I opted to make fewer changes to our fork (i.e., I didn't merge Andrea's fork into our fork, even if there is some dependency similarity in the POMs) as this data can be determined in Spark summary stat calls. We may be interested in how to take forward some of the enhancement opportunities for IGUANA upstream should we see the need for more IGUANA work later, but then again we may not do that as our needs are narrower. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maanti
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Now a screenshot from the re-run of the randomized order queries, followed by a screenshot showing the two runs on the randomized order queries side by side. F41722569: Screenshot 2024-01-27 at 6.36.58 AM.png <https://phabricator.wikimedia.org/F41722569> F41722573: Screenshot 2024-01-27 at 6.38.45 AM.png <https://phabricator.wikimedia.org/F41722573> TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Now, the screenshot from the randomized order queries. I'll run one more time to see that comparable output is achieved. Those were produced with the following. This latest output file has been moved to `result.nt.003`. scala> val joined6 = wikidata.as("w").join(full.as("f")).where("w.id = f.id and w.success = true and w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, seed=42) scala> joined6.count // matches same as joined5.count scala> joined6.repartition(1).write.option("compression", "none").text("queries_for_performance_randomized_2024_01_26.txt") scala> :quit $ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_randomized_2024_01_26.txt/part-0-131df78f-da7a-4ffc-aad4-9874342165ca-c000.txt ./queries_for_performance_randomized.txt $ sha1sum queries_for_performance.txt queries_for_performance_randomized.txt $ # they're different $ diff queries_for_performance.txt queries_for_performance_randomized.txt | wc -l $ # they're very different $ cp wdqs-split-test.yml wdqs-split-test-randomized.yml $ # changed pointers to query file to be queries_for_performance_randomized.txt $ bash start-iguana.sh wdqs-split-test-randomized.yml $ mv result.nt result.nt.003 TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Now, a screenshot showing the re-run. And then a screenshot showing them side-by-side. This is just for the visual, and the data produced from IGUANA (what is in the `.nt` output that we can convert to a handy CSV) should be more telling. Next up, I'll randomize the order of the queries and do it again. F41720004: Screenshot 2024-01-26 at 10.19.36 AM.png <https://phabricator.wikimedia.org/F41720004> F41720006: Screenshot 2024-01-26 at 10.20.48 AM.png <https://phabricator.wikimedia.org/F41720006> TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. Dropping in a screenshot from Grafana from this first pass and made a copy of `result.nt` to `result.nt.001`. Re-running to see that server behavior is similar. F41718197: Screenshot 2024-01-25 at 7.43.14 PM.png <https://phabricator.wikimedia.org/F41718197> TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt added a comment. For the first pass, the following configuration is being used for an hour long test conducted from `stat1006` with config file `wdqs-split-test.yml` as follows. datasets: - name: "split" connections: - name: "baseline" endpoint: "https://wdqs1022.eqiad.wmnet/sparql; - name: "wikidata_main_graph" endpoint: "https://wdqs1024.eqiad.wmnet/sparql; tasks: - className: "org.aksw.iguana.cc.tasks.impl.Stresstest" configuration: timeLimit: 360 warmup: timeLimit: 3 workers: - threads: 4 className: "SPARQLWorker" queriesFile: "queries_for_performance.txt" timeOut: 5000 queryHandler: className: "DelimInstancesQueryHandler" configuration: delim: "### BENCH DELIMITER ###" workers: - threads: 4 className: "SPARQLWorker" queriesFile: "queries_for_performance.txt" timeOut: 6 parameterName: "query" gaussianLatency: 100 metrics: - className: "QMPH" - className: "QPS" - className: "NoQPH" - className: "AvgQPS" - className: "NoQ" storages: - className: "NTFileStorage" configuration: fileName: result.nt `queries_for_performance.txt` is based on the following basic code, which says to get queries known to work against both the full graph and the main (non-scholarly) graph and returning similar results, so as to reduce garbage input and somewhat better control the parameters of the test. scala> val wikidata = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_classified.parquet") scala> val full = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_classified.parquet") scala> val joined5 = wikidata.as("w").join(full.as("f")).where("w.id = f.id and w.success = true and w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct scala> joined5.repartition(1).write.option("compression", "none").text("queries_for_performance_2024_01_25.txt") scala> :quit $ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_25.txt/part-0-6b8caed3-3a4d-4cb2-bf74-6bbcd7af0478-c000.txt ./queries_for_performance.txt $ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar iguana-3.3.3.jar wdqs-split-test.yml The IGUANA build is based on https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/4 . TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt added a comment. Imports seemed to work. **Non-scholarly article side (proxied to wdqs1024.eqiad.wmnet)** F41650681: split-non-schol-side.gif <https://phabricator.wikimedia.org/F41650681> **Scholarly article side (proxied to wdqs1023.eqiad.wmnet)** F41650680: split-schol-side.gif <https://phabricator.wikimedia.org/F41650680> Next steps: - Add automated unit test(s) to the patch. - Add doc / pointer to Pastes somewhere handy Also, non-blocking for this here task, but mentioning here for findability - the queries in T349512: [Analytics] Collect multiple sets of SPARQL queries <https://phabricator.wikimedia.org/T349512> will provide the fuller view on query coverage and their runtime characteristics. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Gehel, RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt added a comment. After an update to the script (PS6) and a fresh run of the same commands new files have been `hdfs-rsync`'d to `stat1006:~dr0ptp4kt/gzips` in anticipation of doing a file transfer over to the WDQS graph split test servers. Here's a very small sample of what the files look like: $ zcat part-01022-c261bb68-4091-4613-ae52-88ce97d22c14-c000.txt.gz | tail -10 <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "\u0935\u093F\u0915\u093F\u092E\u093F\u0921\u093F\u092F\u093E \u0936\u094D\u0930\u0947\u0923\u0940"@ne . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "\u043A\u0430\u0442\u0435\u0433\u043E\u0440\u0438\u0458\u0430 \u043D\u0430 \u0412\u0438\u043A\u0438\u043C\u0435\u0434\u0438\u0458\u0438"@sr . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "\u7DAD\u57FA\u5A92\u9AD4\u5206\u985E"@yue . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "Wikimedia-Kategorie"@de-ch . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "catigur\u00ECa di nu pruggettu Wikimedia"@scn . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "categoria di un progetto Wikimedia"@it . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/version> "1979010859"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "kategori Wikimedia"@map-bms . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "Wikimedia-kategoriija"@se . <http://www.wikidata.org/entity/Q99896811> <http://schema.org/description> "\u7DAD\u57FA\u5A92\u9AD4\u5206\u985E"@zh-mo . $ zcat part-01023-c261bb68-4091-4613-ae52-88ce97d22c14-c000.txt.gz | head -10 <http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5> <http://wikiba.se/ontology#rank> <http://wikiba.se/ontology#NormalRank> . <http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5> <http://www.wikidata.org/prop/statement/P31> <http://www.wikidata.org/entity/Q4167836> . <http://www.wikidata.org/entity/statement/Q99896811-7623BB4C-2D20-4D2E-8784-E2ED8AD3E8E5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://wikiba.se/ontology#BestRank> . <https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973> <http://schema.org/about> <http://www.wikidata.org/entity/Q99896811> . <https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973> <http://schema.org/isPartOf> <https://ar.wikipedia.org/> . <https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Article> . <https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973> <http://schema.org/inLanguage> "ar" . <https://ar.wikipedia.org/wiki/%D8%AA%D8%B5%D9%86%D9%8A%D9%81:%D8%B4%D8%B1%D9%83%D8%A7%D8%AA_%D8%B3%D9%88%D9%8A%D8%B3%D8%B1%D9%8A%D8%A9_%D8%A3%D8%B3%D8%B3%D8%AA_%D9%81%D9%8A_1973> <http://schema.org/name> "\u062A\u0635\u0646\u064A\u0641:\u0634\u0631\u0643\u0627\u062A \u0633\u0648\u064A\u0633\u0631\u064A\u0629 \u0623\u0633\u0633\u062A \u0641\u064A 1973"@ar . <https://en.wikipedia.org/wiki/Category:Swiss_companies_established_in_1973> <http://schema.org/inLanguage> "en" . <https://en.wikipedia.org/wiki/Category:Swiss_companies_established_in_1973> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Article> . You'll notice that the the files are partitioned by `context` and `subject`, and within a partition they're also sorted by `context` and `subject` (the `context` field isn't part of the output, though; one would get that from the source tables). So you may see, as in this example, things that are logically clustered together spanning from the end of one file and the beginning of the next partition in sequence. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt added a subscriber: RKemper. dr0ptp4kt added a comment. I ran the current version of the code as follows: spark3-submit --master yarn --driver-memory 16G --executor-memory 12G --executor-cores 4 --conf spark.driver.cores=2 --conf spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 --class org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator --name wikibase-rdf-statements-spark ~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar --input-table-partition-spec discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=wikidata_main --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main --num-partitions 1024 spark3-submit --master yarn --driver-memory 16G --executor-memory 12G --executor-cores 4 --conf spark.driver.cores=2 --conf spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 --class org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator --name wikibase-rdf-statements-spark ~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar --input-table-partition-spec discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=scholarly_articles --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol --num-partitions 1024 And updated the permissions. hdfs dfs -chgrp -R analytics-search-users hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main hdfs dfs -chgrp -R analytics-search-users hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol From stat1006 it is possible to use the already present `hdfs-rsync` (script fronting Java utility) to copy the produced files, like this: hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol/ file:/destination/tot/nt_wd_schol_gzips/ hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main/ file:/destination/to/nd_wd_main_gzips/ Note: each directory has 1,024 files of 100 MB +/- a certain number of MB. The Spark routine randomly samples the data before sorting into partitions, and although all partitions have data, there's mild skew so the files aren't all exactly the same number of records. @bking / @RKemper / @dcausse / I will discuss more this week. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt added a comment. Not using right now, but here's roughly how one might go about generating more expanded Turtle statements without reverse-mapping prefixes: F41561068 <https://phabricator.wikimedia.org/F41561068> TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt added a subscriber: EBernhardson. dr0ptp4kt added a comment. Adding a note so I don't forget: advice from @BTullis is to avoid NFS if possible, and advice from @JAllemandou is to consider use of `hdfs-rsync` (after our call I sought this out and found these: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/python/refinery/hdfs.py and https://gerrit.wikimedia.org/g/analytics/hdfs-tools/deploy/+/2445aec92f6b3d409531fb74ab3f9a22d9716823/bin/hdfs-rsync and https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/hdfs-rsync ). Chances are we'd need to add a ferm and possibly where up some Kerberos stuff on the WDQS servers if going the hdfs-rsync route. During a Meet today @EBernhardson and I with the group were discussing possible use of a mechanism similar to https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/shared/transfer_to_es.py?ref_type=heads#L74-83 and https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/mjolnir/kafka/bulk_daemon.py?ref_type=heads where a file is moved to Swift via Airflow and Mjolnir client code listens for the Kafka events of the URLs from which to fetch the produced files (I haven't read this code closely yet, just parroting what I think I heard). We'll likely need to do these data transfers more than once, so it'll be good to have some level of support of automation. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a comment. The job completed. The counts match up on this productionized job compared with the prior one run in my namespace. Following are some Hive queries in case needed later. Below that is a really small sample of the resultant data in tabular format for each partition. **Counts** select count(1) from discovery.wikibase_rdf_scholarly_split where snapshot = '20231016' and wiki = 'wikidata' and scope = 'scholarly_articles'; 7643858365 select count(1) from discovery.wikibase_rdf_scholarly_split where snapshot = '20231016' and wiki = 'wikidata' and scope = 'wikidata_main'; 7677112695 **Samples** Note: because the target sample size is so small, it's actually possible to get slightly less than the target number of records due to sparseness in a randomly selected set. One can compensate by setting the numerator higher or the denominator lower if one likes to reduce the possibility of such potential artifacts (e.g., to avoid getting 27 records when one really wants 30 records; below we get 30 records apiece, mind you). Note the horizontal scrollbars at the bottom of the tabular data in case the tables overflow on one's browser's settings in Phabricator (mine do). select "| " || concat_ws(" | ", subject, predicate, object, context) from discovery.wikibase_rdf_scholarly_split where snapshot = '20231016' and wiki = 'wikidata' and scope = 'scholarly_articles' and rand() <= (30/7643858365) distribute by rand() sort by rand() limit 30; {icon graduation-cap spin} | subject | predicate | object | context | | | --- | --- | - | | http://www.wikidata.org/entity/statement/Q114851466-BB650063-6818-4AF5-88FD-743A5520811C | http://www.w3.org/ns/prov#wasDerivedFrom| http://www.wikidata.org/reference/a84e44b8b704dd021b87b792549c1623fc1edff3 | http://www.wikidata.org/entity/Q114851466 | | http://www.wikidata.org/entity/statement/Q73327727-EE2DF999-D668-4D6A-860F-B5FE8B93747E | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q73327727 | | http://www.wikidata.org/entity/Q45987415 | http://www.wikidata.org/prop/direct/P407| http://www.wikidata.org/entity/Q1860 | http://www.wikidata.org/entity/Q45987415 | | http://www.wikidata.org/entity/statement/Q44327803-9B2ED327-7B22-41B3-927D-F0D780F14C63 | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q44327803 | | http://www.wikidata.org/entity/Q40775359 | http://schema.org/description | "\u043D\u0430\u0443\u0447\u043D\u0430\u044F \u0441\u0442\u0430\u0442\u044C\u044F"@ru| http://www.wikidata.org/entity/Q40775359 | | http://www.wikidata.org/entity/statement/Q33904556-172A1324-DF02-4555-AC23-CD26DED1A182 | http://www.wikidata.org/prop/statement/P304 | "49-52" | http://www.wikidata.org/entity/Q33904556 | | http://www.wikidata.org/entity/Q21994578 | http://schema.org/description | "wetenschappelijk artikel (gepubliceerd op 2009/10/09)"@nl | http://www.wikidata.org/entity/Q21994578 | | http://www.wikidata.org/entity/statement/Q93701619-747BA9CD-B887-4755-A744-01607FD15567 | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q93701619 | | http://www.wikidata.org/entity/statement/Q42812060-1DBF45B2-E920-4CF6-8011-A94820FF10EA | http://wikiba.se/ontology#rank | http://wikiba.se/ontology#NormalRank | http://www.wikidata.org/entity/Q42812060 | | http://www.wikidata.org/entity/statement/Q36819529-349D4DA8-BC3D-4B01-90F4-C5D42F4E3683 | http://www.wikidata.org/prop/statement/P50 | http://www.wikidata.org/entity/Q58034888 |
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt triaged this task as "High" priority. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: EBernhardson, bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T337013: [Epic] Splitting the graph in WDQS
dr0ptp4kt closed subtask T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T337013 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, RKemper, bking, tfmorris, elal, karapayneWMDE, Aklapper, Lydia_Pintscher, me, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BeautifulBold, Suran38, Invadibot, maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a subscriber: EBernhardson. dr0ptp4kt added a comment. Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch merged. This is deployed to Search's Airflow instance and the job is running. Thank you, @dcausse and @EBernhardson. Here's the location of stuff for this job that's currently running. --deploy-mode cluster hdfs:///wmf/cache/artifacts/airflow/search/rdf-spark-tools-0.3.137-jar-with-dependencies.jar --input-table-partition-spec discovery.wikibase_rdf_t337013/date=20231016/wiki=wikidata --output-table-partition-spec discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata max_attempts: 1 TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: EBernhardson, bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt moved this task from In Progress to Needs review on the Discovery-Search (Current work) board. dr0ptp4kt added a comment. Here's what I saw after re-running. So, we should be good with the latest patchset that goes without distinct() on the final graphs. Without distinct() on final graphs - 1h48m [dr0ptp4kt.wikibase_rdf_scholarly_split_refactor_no_distinct_less_cache] scholarly_articles: 7_643_858_365, wikidata_main: 7_677_112_695 With distinct() on final graphs - 1h55m [dr0ptp4kt.wikibase_rdf_scholarly_split_refactor_using_distinct_less_cache] scholarly_articles: 7_643_858_365, wikidata_main: 7_677_112_695 TASK DETAIL https://phabricator.wikimedia.org/T347989 WORKBOARD https://phabricator.wikimedia.org/project/board/1227/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a comment. Update: it seems to be working. This, I'd say this is maybe 75% complete. It takes about 1h40m to run and generate the two different partitions. WIP/Draft patches posted at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/969229 and ^ . They require some refactoring and introduction of tests, and probably some extra config variables - I'll connect with Joseph about that last part. David, Erik, and I spoke through things earlier today while I opened the repos in my IDE. I'll request code review so I can iterate on this. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. I also see https://grafana.wikimedia.org/d/00264/wikidata-dump-downloads?orgId=1=5m=now-2y=now which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and delete many old dumps logs <https://phabricator.wikimedia.org/T280678> and friends) and a pointer from a colleague). As I noted, there are some complications around the 200s, and I see from T280678 <https://phabricator.wikimedia.org/T280678>'s pointer to source processing at https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/dumpDownloads.php#L12 consideration for 206s and 200s. Future TODO in case we wanted to figure out how to deal with the different-sized 200s and apparent downloader utilities. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. ^ Update. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. Looking at yesterday's downloads with a rudimentary grep we're not far from 1K downloads, and that's just for the //latest-all// ones. That also doesn't consider mirrors. stat1007:~$ zgrep wikidatawiki /srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20231024.gz | grep latest-all | grep " 200 " | wc -l Now, it's good to keep in mind that some of these downloads are mirror jobs themselves, but looking at some of the source IPs it's clear that a good number of them also are not mirrors. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a comment. It took about 26min 24s to write `S_direct_triples` (7_293_925_470 rows) in basic Parquet. It's not all the rows (not even for its own partition, as that will include Value and Reference triples as well), but this means it ought to be possible for the job to write total 15B rows with about an hour of wall time (maybe double that to play it safe). TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a comment. TL;DR this is about 45% done. This week I was working to address non-performant, often hanging or crashy, Spark runs. Last night I managed to get this running better, producing a reduction (the equivalent of `val_triples_only_used_by_sas` from https://people.wikimedia.org/~andrewtavis-wmde/T342111_spark_sa_subgraph_metrics.html ) in 8 minutes in one pass - instead of 3 hours or, worse, something longer followed by an indefinite hang or crash. The key here was a couple things. First, higher resource limits (this seems obvious, but isn't always true) and attempting to prevent Spark from broadcast joins (it still tries to do them based on the Spark web UI's DAGs, but doesn't seem to do them at bad times, at least). "spark.driver.memory": "16g", "spark.driver.cores": 2, "spark.executor.memory": "12g", "spark.executor.cores": 4, "spark.executor.memoryOverhead": "4g", "spark.sql.shuffle.partitions": 512, 'spark.dynamicAllocation.maxExecutors': 128, 'spark.locality.wait': '1s', # test 0 'spark.sql.autoBroadcastJoinThreshold': -1 Second, removal of `cache()` calls and setting some join tables as their own DataFrames. This means likely in practice more disk-based merge behavior on the executors for huge joins, but it works better. I'm interested to explore bucketing as an optimization strategy, but may forego this for production of the table as it doesn't seem necessary at the moment - it may however be useful for the produced table for people doing further join operations so am thinking about this. I had the small reduction pushing to a Parquet directory in HDFS last night. I will be working to see how performant and reliable pushing a larger data set is and will report back here. From there I'll port from Python to Scala. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggregate across all downloaders of those files) versus the same for if this was done with the .jnl where they don't have to munge and import. Napkin-mathsing it, the thought was that the savings on energy accrues about as soon as the 16 cores x 12 hours of compression time on the .jnl has been "saved" by people in aggregate not needing to run the import process (and I'm just waving away the client side decompression, which in a way technically happens twice for the .ttl.gz user but only once for the .jnl.zst user, and any other disk or network transfer pieces, as those are all close enough, I suppose). I'll go check on what stats may be readily available on dumps downloads. Good point on having a checksum and timestamp. Yeah, it would be nice to have it in an on-demand place without the need for extra data transfer! TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt added a comment. Personalized dev environment on analytics cluster with Airflow setup (stat1006) - was able to execute job, slightly hacked up to get specific date and not keep running regularly (eats lots of disk) to get `dr0ptp4kt.wikibase_rdf_with_split` using my Kerberos principal. Verifying Jupyter notebook approach from David / Andy on stat1005 - some glitches as to be expected, but worked okay by doubling timeouts and removing some caps. Next up, working on a job that will do the splitting in a fashion similar to what's achieved with the join-antijoin approach of the notebooks. I'll want to have the produced data separated out from the existing table, I think - in this case it would be okay in my opinion to use some extra disk. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: bking, dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T344905: Publish WDQS JNL files to dumps.wikimedia.org
dr0ptp4kt added a comment. > I think the ammount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered. Closing the loop, posted my experience at T347605#9229608 <https://phabricator.wikimedia.org/T347605#9229608>. TASK DETAIL https://phabricator.wikimedia.org/T344905 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: xcollazo, bking, Krinkle, dr0ptp4kt, Abbe98, Gehel, Addshore, Aklapper, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, WDoranWMF, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. @bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org <https://phabricator.wikimedia.org/T344905> and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` <https://phabricator.wikimedia.org/T347647>. FWIW I do think it would be good to automate this. As a matter of getting to a functional WDQS local environment replete with BlazeGraph data, it would accelerate things a lot. I think my only reservations are that: 1. It takes time to automate. Any rough guess on level of effort for that? I understand that'd inform relative prioritization against the large pile of other things. 2. The energy savings is possibly unclear, at least under current case (but that's partly because it's hard to know how much energy is being expended, which could be guessed at from number of dump downloads; not sure how easy it is to get those stats; this is different from the bandwidth transfer on Cloudflare R2). However, I would probably err on the side of assuming that ultimately the automation will boost the technical communities' interest and ability to trial things locally (right now the barriers are somewhat prohibitive) and that the energy savings will roughly net out - ironically, if it attracts more people, they'll in the aggregate consume more energy, but they'll also be vastly more efficient energy-wise because they won't have to ETL, which takes a lot of compute resources. For potential reusers (e.g., Enterprise or other institutions) it might help smooth things along a bit, although this is mostly just my conjecture. Thinking ahead a little, we'd probably want to generalize anything so that it can take arbitrary `.jnl`s, for example for split graphs. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. Addressing @Addshore's comment in T344905#9210122 <https://phabricator.wikimedia.org/T344905#9210122>... > I think the amount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered. Here's what I saw for performance: /mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst wikidata.jnl.zst: 1265888788480 bytes real219m10.733s user29m51.350s sys 12m53.425s This was on an i7-8700 CPU @ 3.20GHz. When I checked this with `top` it seemed to be using about 0.8-1.6 processor, but hovering around 1 processor, at any given time. `unzstd` doesn't support multiple processor decompress from what I see. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. Drawing from your inspiration, I downloaded with `wget` overnight and the `sha1sum' now matches that from `wdqs1016`. Deflating now, will update with results. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt added a comment. I'm going to close this for now given that the later dump munged okay and there seems to be an underlying issue somewhere probably related to file transfer. The ``-- --skolemize`` will be a thing to consider for any future run, nonetheless. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
dr0ptp4kt added a comment. I did manage to run a `sha1sum` on the older dump where the import had failed. /mnt/w$ time sha1sum latest-all.ttl.gz dedad5a589b3a3661a1f9ebb7f1a6bcbce1b4ef2 latest-all.ttl.gz real28m47.000s user3m21.104s sys 0m46.825s $ ls -al latest-all.ttl.gz -rwxrwxrwx 1 adam adam 129294028486 Sep 27 05:35 latest-all.ttl.gz It seems like there was a data corruption somewhere in the transfer or persistence to disk or post-download. I don't see this `sha1sum` anywhere. It's conceivable something went wrong during the course of the `sha1sum`s themselves, but I'm not going to spend more time on this. Just wanted to document for future selves. Just a remark: normally, one would expect that the download would fail if it were in the transfer itself. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. Here's the `sha1sum` for the latest file I had downloaded: /mnt/x$ time sha1sum wikidata.jnl.zst 62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab wikidata.jnl.zst real77m16.215s user8m39.726s sys 2m42.932s TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation
dr0ptp4kt added a comment. For me the first 300 GB of the file went really, really fast. But `axel` was dropping connections, similar to when I had downloaded the large 1 TB file. So this download took about 5 hours. I'm pretty sure it could be done in 1-3 hours, though, if everything were working well. Now, I encountered an error, and this was reproducible with two separate downloads. @bking does a test on the file yield the same corrupted block detected warning for you by any chance if you download the the zst? What about if you do it with your already existing copy? /mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst wikidata.jnl.zst : 649266 MB... wikidata.jnl.zst : Decoding error (36) : Corrupted block detected real124m59.115s user17m44.647s sys 7m24.509s /mnt/x $ ls -l wikidata.jnl.zst -rwxrwxrwx 1 adam adam 342189138219 Oct 3 02:32 wikidata.jnl.zst I've kicked off a `sha1sum`, but this will take a while to run. /mnt/x$ time sha1sum wikidata.jnl.zst TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
dr0ptp4kt added a comment. The addshore .jnl (August file) does launch nicely with `./runBlazegraph.sh`. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, BTullis, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
dr0ptp4kt added a comment. The addshore .jnl (August file) download completed, with use of the Linux tool axel. Working from my memory as I checked on the download on my 1 Gbps connection the first 800 or so GB downloaded over the first 3-4 hours, thrn (as some Clodflare connections seemed to fall off) the remaining 400 or so GB took another 18 hours, so total download time was about 22 hours. Next will be to verify that it loads cleanly. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org
[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`
dr0ptp4kt added a comment. Update - the newer dump munged without any problems. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331 ___ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org