[Wikidata-bugs] [Maniphest] T215413: Image Classification Research and Development

2024-05-16 Thread dr0ptp4kt
dr0ptp4kt removed a project: Reading-Admin. TASK DETAIL https://phabricator.wikimedia.org/T215413 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Miriam, dr0ptp4kt Cc: dr0ptp4kt, fkaelin, AikoChou, Capankajsmilyo, Mholloway, Ottomata, Jheald, Cirdan

[Wikidata-bugs] [Maniphest] T123349: EPIC: Article placeholders using wikidata

2024-05-16 Thread dr0ptp4kt
dr0ptp4kt removed a project: Reading-Admin. TASK DETAIL https://phabricator.wikimedia.org/T123349 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: waldyrious, Lydia_Pintscher, Nasirkhan, Aklapper, StudiesWorld, Lucie, atgo, dr0ptp4kt

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt added a comment. I actually just added a link to https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#See_also . Marking this here ticket as resolved after noticing it was still open. TASK DETA

[Wikidata-bugs] [Maniphest] T352538: [EPIC] Evaluate the impact of the graph split

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt closed subtask T355037: Compare the performance of sparql queries between the full graph and the subgraphs as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T352538 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc

[Wikidata-bugs] [Maniphest] T363721: Show "small logo or icon" as fallback image in search

2024-05-13 Thread dr0ptp4kt
dr0ptp4kt edited projects, added Wikidata; removed Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T363721 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, ChristianKl, Danny_Benjafield_WMDE

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt claimed this task. dr0ptp4kt added a comment. Thanks @RKemper ! These speed gains are welcome news. We should discuss in a near future meeting if there are any further actions. I can see how we may want to set the bufferCapacity

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment. Mirroring comment in T359062#9783010 <https://phabricator.wikimedia.org/T359062#9783010>: > And for the second run in T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T36

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment. On the gaming-class 2018 desktop, although the `bufferCapacity` value at 10**0** sped things up as described on this here ticket, application of the CPU governor change did not seem to have any additional bearing (it took 2.47 days as compared to its previous

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-09 Thread dr0ptp4kt
dr0ptp4kt added a comment. And for the second run in T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> we saw that this took about 3089 minutes, or about 2.**15** days, for the scholarly article entity

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-07 Thread dr0ptp4kt
dr0ptp4kt added a comment. In T362920#9776418 <https://phabricator.wikimedia.org/T362920#9776418>, @RKemper wrote: > @dr0ptp4kt > >> we saw that this took about 3702 minutes, or about 2.57 //hours// > > Typo you'll want to fix here and in the original: 2.

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-06 Thread dr0ptp4kt
dr0ptp4kt added a comment. Mirroring comment in T359062#9775908 <https://phabricator.wikimedia.org/T359062#9775908>: > In T362920 <https://phabricator.wikimedia.org/T362920>: Benchmark Blazegraph import with increased buffer capacity (and other factors) we saw that this

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-05-06 Thread dr0ptp4kt
dr0ptp4kt added a comment. In T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920> we saw that this took about 3702 minutes, or about 2.57 hours, for the scholarly article entity with the CPU governor

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-02 Thread dr0ptp4kt
dr0ptp4kt added a comment. Another thing that can be nice for figuring out stuff later is to add some timing and a simple log file. A command like the following was helpful when I was trying this out on the gaming-class desktop (you may not need this if your tmux session lets you scroll

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-05-02 Thread dr0ptp4kt
dr0ptp4kt added a comment. @RKemper I think that's captured in P54284 <https://phabricator.wikimedia.org/P54284> . If you need to get a copy of the files, there's a pointer in T350106#9381611 <https://phabricator.wikimedia.org/T350106#9381611> for how one might go about copyi

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt added a project: Wikidata. TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, AWesterinen, karapayneWMDE

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors)

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt renamed this task from "Benchmark Blazegraph import with increased buffer capacity" to "Benchmark Blazegraph import with increased buffer capacity (and other factors)". TASK DETAIL https://phabricator.wikimedia.org/T362920 EMAIL PREFERENCES https://phabr

[Wikidata-bugs] [Maniphest] T362920: Benchmark Blazegraph import with increased buffer capacity

2024-04-18 Thread dr0ptp4kt
dr0ptp4kt created this task. dr0ptp4kt added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION In T359062: Assess Wikidata dump import hardware <https://phabricator.wikimedia.org/T359062> there's compelling evidence that increasing

[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-16 Thread dr0ptp4kt
dr0ptp4kt added a comment. **Running time** Total Uptime: 55 min This was faster than in T347989#9335980 <https://phabricator.wikimedia.org/T347989#9335980>. Nice! **Counts** To be discussed in code review. **Samples ** These look similar to about what we'd

[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-16 Thread dr0ptp4kt
dr0ptp4kt added a comment. I kicked off a run using the current version of the patch with the following command and backing table, and its status should be able to be followed here: https://yarn.wikimedia.org/cluster/app/application_1713178047802_16409 So long as I haven't made an error

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-10 Thread dr0ptp4kt
dr0ptp4kt added a comment. Good news. With the N-triples style scholarly entity graph files, with a buffer capacity of 10**0**, a write retention queue capacity of 4000, and a heap size of 31g, on the gaming-class desktop, it took about 2.40 days. Recall that with buffer capacity

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a comment. Update: With the buffer capacity at 10**0**, file number 550 of the scholarly graph was imported as of `Mon Apr 8 03:22:08 PM CDT 2024` . So, under 28 hours so far (buffer capacity at 10 was more than 36 hours). Processing part-00550-46f26ac6-0b21

[Wikidata-bugs] [Maniphest] T361246: scap deploy should not repool a wdqs node that is depooled

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361246 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1

[Wikidata-bugs] [Maniphest] T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361935 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, dr0ptp4kt, pfischer, dcausse, Aklapper

[Wikidata-bugs] [Maniphest] T361950: Ensure that WDQS query throttling does not interfere with federation

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T361950 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Daniel_Mietchen, Aklapper, dcausse, Danny_Benjafield_WMDE, S8321414

[Wikidata-bugs] [Maniphest] T362060: Generalize ScholarlyArticleSplitter

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt added a project: Discovery-Search (Current work). TASK DETAIL https://phabricator.wikimedia.org/T362060 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1

[Wikidata-bugs] [Maniphest] T361114: Alert Search Platform and/or DPE SRE when Wikidata is lagged

2024-04-08 Thread dr0ptp4kt
dr0ptp4kt set the point value for this task to "2". TASK DETAIL https://phabricator.wikimedia.org/T361114 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Lucas_Werkmeister_WMDE, dcausse, Aklapper, bking, Danny_Benjafield_WMDE,

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-07 Thread dr0ptp4kt
dr0ptp4kt added a comment. With bufferCapacity at 10**0**, kicked it off again with the scholarly article entity graph files: ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date | tee loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s 0 -e 0 2

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-07 Thread dr0ptp4kt
dr0ptp4kt added a comment. Update. On the gaming-class machine it took about 3.25 days to import the scholarly article entity graph, using a buffer capacity of 10 (compare this with 5.875 days on wdqs1024 <https://phabricator.wikimedia.org/T350465#9405888>). This re

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. Just updating on how far along this run is, file 550 of the scholarly article entity side of the graph is being processed. There are files 0 through 1023 for this side of the graph. Note that I did think to `tee` output this time around so that generally/hopefully

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-04 Thread dr0ptp4kt
dr0ptp4kt added a comment. Following roughly the procedure in P54284 <https://phabricator.wikimedia.org/P54284> to rename the Spark-produced graph files (and updating `loadData.sh` with `FORMAT=part-%05d-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz` and still having a `date` call

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-03 Thread dr0ptp4kt
dr0ptp4kt added a comment. This morning of April 3 around 6:25 AM I had SSH'd to check progress, and it was working, but going slowly, similar to the day before. It was on a file number in the 1200s, but I didn't write down the number or copy terminal output; I do remember seeing

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-02 Thread dr0ptp4kt
dr0ptp4kt added a comment. Now this is interesting: we're now past 4 days (about 4 days and 1 hour) of this running, and with buffer capacity at 10 instead of 10**0** (but this time without any gap between the batches of files), there's still a good way to go yet. Processing

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-04-01 Thread dr0ptp4kt
dr0ptp4kt added a comment. The run with with buffer at 10**0** and heap size at 31g and queue capacity at 4000 on the gaming-class desktop completed. Processing wikidump-01332.ttl.gz http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13580ms, elap

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-21 Thread dr0ptp4kt
dr0ptp4kt added a comment. **AWS EC2 servers** After exploring a battery of EC2 servers, four instance types were selected and the commands posted were run. The configuration most like our `wdqs1021-1023` servers (third generation Intel Xeon) is listed first. The fastest option

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-21 Thread dr0ptp4kt
dr0ptp4kt added a comment. By the way, I'm attempting a run for the first 1332 munged files (one shy of the 1333 where terminated last time around) with buffer at 10**0** and heap size at 31g and queue capacity at 4000 on the gaming-class desktop to see whether this imports smoothly

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-20 Thread dr0ptp4kt
dr0ptp4kt added a comment. The run to check with heap size of 31g, queue capacity of 8000, and buffer at 10**0** stalled at file 107. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-20 Thread dr0ptp4kt
dr0ptp4kt added a comment. Attempting a run with a **queue capacity of 8000** and buffer of 10**0** and heap size of 16g on the gaming-class desktop to mimic the MacBook Pro, things were slower than a queue capacity of 4000 and buffer of 100 and heap size of 31g on the gaming-class

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment. **About Amazon Neptune** Amazon Neptune was set to import using the simpler N-Triples file format with its serverless configuration at 128 NCUs (about 256 GB of RAM with some attendant CPU). We don't use N-Triples files in our existing import process

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment. **Going for the full import** Further import commenced from there with a `bufferCapacity` of 10**0**: ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date Mon Mar 4 06:31:06 PM CST 2024 ubuntu22:~/rdf/dist/target/service-0.3.138

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment. **More about bufferCapacity** Similarly, with 150 munged files, was attempted with the buffer in RWStore.properties increased from 10 to 10**0** with the target as the NVMe. com.bigdata.rdf.sail.bufferCapacity=100 ubuntu22:~/rdf/dist

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-19 Thread dr0ptp4kt
dr0ptp4kt added a comment. **More about NVMe versus SSD** Runs were also done to see the effects on 150 munged files (out of a set of 2202 files) from the full Wikidata import, which allows for exercising more disk related pieces. This was tried with both types of target disk - SATA SSD

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-08 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: ssingh. dr0ptp4kt added a comment. @ssingh would you mind if the following command is run on one of the newer cp hosts with a new higher write throughput NVMe? If so, got a recommended node? I don't have access, but I think @bking may. `sudo sync; sudo

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-08 Thread dr0ptp4kt
dr0ptp4kt added a comment. Thanks @bking ! It looks like the NVMe in this one is not a higher speed one for writes, and I'm also wondering if perhaps its write performance has degraded with age. I'll paste in the results here, but this was slower than the other servers, ironically (although

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-07 Thread dr0ptp4kt
dr0ptp4kt added a comment. First, adding some commands that were used for Blazegraph imports on Ubuntu 22.04. I had originally tried a good number of EC2 instance types, and then after that went back to focus on just four of them with a sequence of repeatable commands (this wasn't scripted

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-06 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-03-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. @VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of `wdqs1025.eqiad.wmnet`? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for `cp1086.eqiad.wmnet` or `cp1086

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt moved this task from Incoming to Current work on the Wikidata-Query-Service board. dr0ptp4kt removed a project: Wikidata-Query-Service. TASK DETAIL https://phabricator.wikimedia.org/T359062 WORKBOARD https://phabricator.wikimedia.org/project/board/891/ EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T359062 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE

[Wikidata-bugs] [Maniphest] T359062: Assess Wikidata dump import hardware

2024-03-04 Thread dr0ptp4kt
dr0ptp4kt changed the task status from "Open" to "In Progress". dr0ptp4kt triaged this task as "Medium" priority. dr0ptp4kt claimed this task. dr0ptp4kt added projects: Wikidata-Query-Service, Discovery-Search (Current work). dr0ptp4kt updated the task des

[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-03-01 Thread dr0ptp4kt
dr0ptp4kt added a comment. Thanks @VRiley-WMF ! @bking is up next for imaging, I think. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: VRiley-WMF, dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh

[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-02-29 Thread dr0ptp4kt
dr0ptp4kt added a parent task: T358533: Hardware requests for Search Platform FY2024-2025. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper

[Wikidata-bugs] [Maniphest] T358727: Reclaim recently-decommed CP host for WDQS (see T352253)

2024-02-29 Thread dr0ptp4kt
dr0ptp4kt added a parent task: T336443: Investigate performance differences between wdqs2022 and older hosts. TASK DETAIL https://phabricator.wikimedia.org/T358727 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Jclark-ctr, VRiley-WMF

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. I summarized at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance . When we have a mailing list post during the next week or so, we'll want to move this to be a subpage of the target page of the post. TASK DETAIL https

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-02 Thread dr0ptp4kt
dr0ptp4kt added a comment. In T355037#9508760 <https://phabricator.wikimedia.org/T355037#9508760>, @dcausse wrote: > @dr0ptp4kt thanks! is the difference in the number of successful queries only explained by the improvement in query time or are there some improvements in t

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-02-01 Thread dr0ptp4kt
dr0ptp4kt added a comment. Here's the output from the latest run based upon a larger set of queries from a random sample of WDQS queries. $ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -e result.nt

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-31 Thread dr0ptp4kt
dr0ptp4kt added a comment. A run is in progress for 78K+ queries from a set of 100,000 random queries. It should be done in under 10 hours from now. scala> val full_random = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-30 Thread dr0ptp4kt
dr0ptp4kt added a comment. Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds with the

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-27 Thread dr0ptp4kt
dr0ptp4kt added a comment. Here were the data produced by IGUANA once piped through the CSV utility introduced in https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs with a command of the following form (for the attentive reader, note that I had to rename

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-27 Thread dr0ptp4kt
dr0ptp4kt added a comment. Now a screenshot from the re-run of the randomized order queries, followed by a screenshot showing the two runs on the randomized order queries side by side. F41722569: Screenshot 2024-01-27 at 6.36.58 AM.png <https://phabricator.wikimedia.org/F41722

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-26 Thread dr0ptp4kt
dr0ptp4kt added a comment. Now, the screenshot from the randomized order queries. I'll run one more time to see that comparable output is achieved. Those were produced with the following. This latest output file has been moved to `result.nt.003`. scala> val joined6 = wikidata.as

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-26 Thread dr0ptp4kt
dr0ptp4kt added a comment. Now, a screenshot showing the re-run. And then a screenshot showing them side-by-side. This is just for the visual, and the data produced from IGUANA (what is in the `.nt` output that we can convert to a handy CSV) should be more telling. Next up, I'll

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt added a comment. Dropping in a screenshot from Grafana from this first pass and made a copy of `result.nt` to `result.nt.001`. Re-running to see that server behavior is similar. F41718197: Screenshot 2024-01-25 at 7.43.14 PM.png <https://phabricator.wikimedia.org/F41718

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt added a comment. For the first pass, the following configuration is being used for an hour long test conducted from `stat1006` with config file `wdqs-split-test.yml` as follows. datasets: - name: "split" connections: - name: "baseline&qu

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

2024-01-25 Thread dr0ptp4kt
dr0ptp4kt updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T355037 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2024-01-04 Thread dr0ptp4kt
dr0ptp4kt added a comment. Imports seemed to work. **Non-scholarly article side (proxied to wdqs1024.eqiad.wmnet)** F41650681: split-non-schol-side.gif <https://phabricator.wikimedia.org/F41650681> **Scholarly article side (proxied to wdqs1023.eqiad.wmnet)** F41650680:

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. After an update to the script (PS6) and a fresh run of the same commands new files have been `hdfs-rsync`'d to `stat1006:~dr0ptp4kt/gzips` in anticipation of doing a file transfer over to the WDQS graph split test servers. Here's a very small sample of what

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-04 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: RKemper. dr0ptp4kt added a comment. I ran the current version of the code as follows: spark3-submit --master yarn --driver-memory 16G --executor-memory 12G --executor-cores 4 --conf spark.driver.cores=2 --conf spark.executor.memoryOverhead=4g --conf

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-12-04 Thread dr0ptp4kt
dr0ptp4kt added a comment. Not using right now, but here's roughly how one might go about generating more expanded Turtle statements without reverse-mapping prefixes: F41561068 <https://phabricator.wikimedia.org/F41561068> TASK DETAIL https://phabricator.wikimedia.org/T350106

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-11-29 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: EBernhardson. dr0ptp4kt added a comment. Adding a note so I don't forget: advice from @BTullis is to avoid NFS if possible, and advice from @JAllemandou is to consider use of `hdfs-rsync` (after our call I sought this out and found these: https

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

2023-11-29 Thread dr0ptp4kt
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T350106 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse, Danny_Benjafield_WMDE, Astuthiodit_1

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt added a comment. The job completed. The counts match up on this productionized job compared with the prior one run in my namespace. Following are some Hive queries in case needed later. Below that is a really small sample of the resultant data in tabular format for each partition

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt triaged this task as "High" priority. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: EBernhardson, bking

[Wikidata-bugs] [Maniphest] T337013: [Epic] Splitting the graph in WDQS

2023-11-20 Thread dr0ptp4kt
dr0ptp4kt closed subtask T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules as Resolved. TASK DETAIL https://phabricator.wikimedia.org/T337013 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-16 Thread dr0ptp4kt
dr0ptp4kt added a subscriber: EBernhardson. dr0ptp4kt added a comment. Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch merged. This is deployed to Search's Airflow instance and the job is running. Thank you, @dcausse and @EBernhardson. Here's the location

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-11-15 Thread dr0ptp4kt
dr0ptp4kt moved this task from In Progress to Needs review on the Discovery-Search (Current work) board. dr0ptp4kt added a comment. Here's what I saw after re-running. So, we should be good with the latest patchset that goes without distinct() on the final graphs. Without distinct

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-26 Thread dr0ptp4kt
dr0ptp4kt added a comment. Update: it seems to be working. This, I'd say this is maybe 75% complete. It takes about 1h40m to run and generate the two different partitions. WIP/Draft patches posted at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/969229 and ^ . They require

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment. I also see https://grafana.wikimedia.org/d/00264/wikidata-dump-downloads?orgId=1=5m=now-2y=now which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and delete many old dumps logs <https://phabricator.wikimedia.org/T280678> and f

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment. ^ Update. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, dr0ptp4kt Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-25 Thread dr0ptp4kt
dr0ptp4kt added a comment. Looking at yesterday's downloads with a rudimentary grep we're not far from 1K downloads, and that's just for the //latest-all// ones. That also doesn't consider mirrors. stat1007:~$ zgrep wikidatawiki /srv/log/webrequest/archive/dumps.wikimedia.org

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-20 Thread dr0ptp4kt
dr0ptp4kt added a comment. It took about 26min 24s to write `S_direct_triples` (7_293_925_470 rows) in basic Parquet. It's not all the rows (not even for its own partition, as that will include Value and Reference triples as well), but this means it ought to be possible for the job to write

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-20 Thread dr0ptp4kt
dr0ptp4kt added a comment. TL;DR this is about 45% done. This week I was working to address non-performant, often hanging or crashy, Spark runs. Last night I managed to get this running better, producing a reduction (the equivalent of `val_triples_only_used_by_sas` from https

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-16 Thread dr0ptp4kt
dr0ptp4kt added a comment. Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggregate across all downloaders of those files) versus the same for if this was done with the .jnl where they don't have to munge

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment. Personalized dev environment on analytics cluster with Airflow setup (stat1006) - was able to execute job, slightly hacked up to get specific date and not keep running regularly (eats lots of disk) to get `dr0ptp4kt.wikibase_rdf_with_split` using my Kerberos

[Wikidata-bugs] [Maniphest] T344905: Publish WDQS JNL files to dumps.wikimedia.org

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment. > I think the ammount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered. Closing the loop, posted my experience at T347605#9229608 <https://phabricator.wikimedia.org/T

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-13 Thread dr0ptp4kt
dr0ptp4kt added a comment. @bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org <https://phabricator.wikimedia.org/T344905> and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error m

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. Addressing @Addshore's comment in T344905#9210122 <https://phabricator.wikimedia.org/T344905#9210122>... > I think the amount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-05 Thread dr0ptp4kt
dr0ptp4kt added a comment. Drawing from your inspiration, I downloaded with `wget` overnight and the `sha1sum' now matches that from `wdqs1016`. Deflating now, will update with results. TASK DETAIL https://phabricator.wikimedia.org/T347605 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt closed this task as "Resolved". dr0ptp4kt added a comment. I'm going to close this for now given that the later dump munged okay and there seems to be an underlying issue somewhere probably related to file transfer. The ``-- --skolemize`` will be a thing to consider for

[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment. I did manage to run a `sha1sum` on the older dump where the import had failed. /mnt/w$ time sha1sum latest-all.ttl.gz dedad5a589b3a3661a1f9ebb7f1a6bcbce1b4ef2 latest-all.ttl.gz real28m47.000s user3m21.104s sys 0m46.825s

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment. Here's the `sha1sum` for the latest file I had downloaded: /mnt/x$ time sha1sum wikidata.jnl.zst 62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab wikidata.jnl.zst real77m16.215s user8m39.726s sys 2m42.932s TASK DETAIL https

[Wikidata-bugs] [Maniphest] T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T347989 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE

[Wikidata-bugs] [Maniphest] T347605: Document process for getting JNL files/consider automation

2023-10-03 Thread dr0ptp4kt
dr0ptp4kt added a comment. For me the first 300 GB of the file went really, really fast. But `axel` was dropping connections, similar to when I had downloaded the large 1 TB file. So this download took about 5 hours. I'm pretty sure it could be done in 1-3 hours, though, if everything were

[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-10-02 Thread dr0ptp4kt
dr0ptp4kt added a comment. The addshore .jnl (August file) does launch nicely with `./runBlazegraph.sh`. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang

[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-09-30 Thread dr0ptp4kt
dr0ptp4kt added a comment. The addshore .jnl (August file) download completed, with use of the Linux tool axel. Working from my memory as I checked on the download on my 1 Gbps connection the first 800 or so GB downloaded over the first 3-4 hours, thrn (as some Clodflare connections seemed

[Wikidata-bugs] [Maniphest] T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`

2023-09-29 Thread dr0ptp4kt
dr0ptp4kt added a comment. Update - the newer dump munged without any problems. TASK DETAIL https://phabricator.wikimedia.org/T347647 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dr0ptp4kt Cc: Aklapper, Stang, dr0ptp4kt, bking

  1   2   >