dr0ptp4kt removed a project: Reading-Admin.
TASK DETAIL
https://phabricator.wikimedia.org/T215413
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Miriam, dr0ptp4kt
Cc: dr0ptp4kt, fkaelin, AikoChou, Capankajsmilyo, Mholloway, Ottomata, Jheald,
Cirdan
dr0ptp4kt removed a project: Reading-Admin.
TASK DETAIL
https://phabricator.wikimedia.org/T123349
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: waldyrious, Lydia_Pintscher, Nasirkhan, Aklapper, StudiesWorld, Lucie,
atgo, dr0ptp4kt
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt added a comment.
I actually just added a link to
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#See_also
. Marking this here ticket as resolved after noticing it was still open.
TASK DETA
dr0ptp4kt closed subtask T355037: Compare the performance of sparql queries
between the full graph and the subgraphs as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T352538
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc
dr0ptp4kt edited projects, added Wikidata; removed Discovery-Search (Current
work).
TASK DETAIL
https://phabricator.wikimedia.org/T363721
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Aklapper, ChristianKl, Danny_Benjafield_WMDE
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt claimed this task.
dr0ptp4kt added a comment.
Thanks @RKemper ! These speed gains are welcome news. We should discuss in a
near future meeting if there are any further actions. I can see how we may want
to set the bufferCapacity
dr0ptp4kt added a comment.
Mirroring comment in T359062#9783010
<https://phabricator.wikimedia.org/T359062#9783010>:
> And for the second run in T362920: Benchmark Blazegraph import with
increased buffer capacity (and other factors)
<https://phabricator.wikimedia.org/T36
dr0ptp4kt added a comment.
On the gaming-class 2018 desktop, although the `bufferCapacity` value at
10**0** sped things up as described on this here ticket, application of the
CPU governor change did not seem to have any additional bearing (it took 2.47
days as compared to its previous
dr0ptp4kt added a comment.
And for the second run in T362920: Benchmark Blazegraph import with increased
buffer capacity (and other factors) <https://phabricator.wikimedia.org/T362920>
we saw that this took about 3089 minutes, or about 2.**15** days, for the
scholarly article entity
dr0ptp4kt added a comment.
In T362920#9776418 <https://phabricator.wikimedia.org/T362920#9776418>,
@RKemper wrote:
> @dr0ptp4kt
>
>> we saw that this took about 3702 minutes, or about 2.57 //hours//
>
> Typo you'll want to fix here and in the original: 2.
dr0ptp4kt added a comment.
Mirroring comment in T359062#9775908
<https://phabricator.wikimedia.org/T359062#9775908>:
> In T362920 <https://phabricator.wikimedia.org/T362920>: Benchmark
Blazegraph import with increased buffer capacity (and other factors) we saw
that this
dr0ptp4kt added a comment.
In T362920: Benchmark Blazegraph import with increased buffer capacity (and
other factors) <https://phabricator.wikimedia.org/T362920> we saw that this
took about 3702 minutes, or about 2.57 hours, for the scholarly article entity
with the CPU governor
dr0ptp4kt added a comment.
Another thing that can be nice for figuring out stuff later is to add some
timing and a simple log file. A command like the following was helpful when I
was trying this out on the gaming-class desktop (you may not need this if your
tmux session lets you scroll
dr0ptp4kt added a comment.
@RKemper I think that's captured in P54284
<https://phabricator.wikimedia.org/P54284> . If you need to get a copy of the
files, there's a pointer in T350106#9381611
<https://phabricator.wikimedia.org/T350106#9381611> for how one might go about
copyi
dr0ptp4kt added a project: Wikidata.
TASK DETAIL
https://phabricator.wikimedia.org/T362920
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Aklapper, dr0ptp4kt, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1,
AWesterinen, karapayneWMDE
dr0ptp4kt renamed this task from "Benchmark Blazegraph import with increased
buffer capacity" to "Benchmark Blazegraph import with increased buffer capacity
(and other factors)".
TASK DETAIL
https://phabricator.wikimedia.org/T362920
EMAIL PREFERENCES
https://phabr
dr0ptp4kt created this task.
dr0ptp4kt added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
In T359062: Assess Wikidata dump import hardware
<https://phabricator.wikimedia.org/T359062> there's compelling evidence that
increasing
dr0ptp4kt added a comment.
**Running time**
Total Uptime: 55 min
This was faster than in T347989#9335980
<https://phabricator.wikimedia.org/T347989#9335980>. Nice!
**Counts**
To be discussed in code review.
**Samples **
These look similar to about what we'd
dr0ptp4kt added a comment.
I kicked off a run using the current version of the patch with the following
command and backing table, and its status should be able to be followed here:
https://yarn.wikimedia.org/cluster/app/application_1713178047802_16409
So long as I haven't made an error
dr0ptp4kt added a comment.
Good news. With the N-triples style scholarly entity graph files, with a
buffer capacity of 10**0**, a write retention queue capacity of 4000, and a
heap size of 31g, on the gaming-class desktop, it took about 2.40 days. Recall
that with buffer capacity
dr0ptp4kt added a comment.
Update: With the buffer capacity at 10**0**, file number 550 of the
scholarly graph was imported as of `Mon Apr 8 03:22:08 PM CDT 2024` . So,
under 28 hours so far (buffer capacity at 10 was more than 36 hours).
Processing part-00550-46f26ac6-0b21
dr0ptp4kt added a project: Discovery-Search (Current work).
TASK DETAIL
https://phabricator.wikimedia.org/T361246
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1
dr0ptp4kt added a project: Discovery-Search (Current work).
TASK DETAIL
https://phabricator.wikimedia.org/T361935
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Daniel_Mietchen, dr0ptp4kt, pfischer, dcausse, Aklapper
dr0ptp4kt added a project: Discovery-Search (Current work).
TASK DETAIL
https://phabricator.wikimedia.org/T361950
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Daniel_Mietchen, Aklapper, dcausse, Danny_Benjafield_WMDE, S8321414
dr0ptp4kt added a project: Discovery-Search (Current work).
TASK DETAIL
https://phabricator.wikimedia.org/T362060
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dcausse, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1
dr0ptp4kt set the point value for this task to "2".
TASK DETAIL
https://phabricator.wikimedia.org/T361114
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Lucas_Werkmeister_WMDE, dcausse, Aklapper, bking, Danny_Benjafield_WMDE,
dr0ptp4kt added a comment.
With bufferCapacity at 10**0**, kicked it off again with the scholarly
article entity graph files:
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date | tee
loadData.log; time ./loadData.sh -n wdq -d /mnt/firehose/split_0/nt_wd_schol -s
0 -e 0 2
dr0ptp4kt added a comment.
Update. On the gaming-class machine it took about 3.25 days to import the
scholarly article entity graph, using a buffer capacity of 10 (compare this
with 5.875 days on wdqs1024
<https://phabricator.wikimedia.org/T350465#9405888>). This re
dr0ptp4kt added a comment.
Just updating on how far along this run is, file 550 of the scholarly article
entity side of the graph is being processed. There are files 0 through 1023 for
this side of the graph. Note that I did think to `tee` output this time around
so that generally/hopefully
dr0ptp4kt added a comment.
Following roughly the procedure in P54284
<https://phabricator.wikimedia.org/P54284> to rename the Spark-produced graph
files (and updating `loadData.sh` with
`FORMAT=part-%05d-46f26ac6-0b21-4832-be79-d7c8709f33fb-c000.ttl.gz` and still
having a `date` call
dr0ptp4kt added a comment.
This morning of April 3 around 6:25 AM I had SSH'd to check progress, and it
was working, but going slowly, similar to the day before. It was on a file
number in the 1200s, but I didn't write down the number or copy terminal
output; I do remember seeing
dr0ptp4kt added a comment.
Now this is interesting: we're now past 4 days (about 4 days and 1 hour) of
this running, and with buffer capacity at 10 instead of 10**0** (but
this time without any gap between the batches of files), there's still a good
way to go yet.
Processing
dr0ptp4kt added a comment.
The run with with buffer at 10**0** and heap size at 31g and queue
capacity at 4000 on the gaming-class desktop completed.
Processing wikidump-01332.ttl.gz
http://www.w3.org/TR/html4/loose.dtd;>blazegraph by SYSTAPtotalElapsed=13580ms, elap
dr0ptp4kt added a comment.
**AWS EC2 servers**
After exploring a battery of EC2 servers, four instance types were selected
and the commands posted were run.
The configuration most like our `wdqs1021-1023` servers (third generation
Intel Xeon) is listed first. The fastest option
dr0ptp4kt added a comment.
By the way, I'm attempting a run for the first 1332 munged files (one shy of
the 1333 where terminated last time around) with buffer at 10**0** and heap
size at 31g and queue capacity at 4000 on the gaming-class desktop to see
whether this imports smoothly
dr0ptp4kt added a comment.
The run to check with heap size of 31g, queue capacity of 8000, and buffer at
10**0** stalled at file 107.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
dr0ptp4kt added a comment.
Attempting a run with a **queue capacity of 8000** and buffer of 10**0**
and heap size of 16g on the gaming-class desktop to mimic the MacBook Pro,
things were slower than a queue capacity of 4000 and buffer of 100 and heap
size of 31g on the gaming-class
dr0ptp4kt added a comment.
**About Amazon Neptune**
Amazon Neptune was set to import using the simpler N-Triples file format with
its serverless configuration at 128 NCUs (about 256 GB of RAM with some
attendant CPU). We don't use N-Triples files in our existing import process
dr0ptp4kt added a comment.
**Going for the full import**
Further import commenced from there with a `bufferCapacity` of 10**0**:
ubuntu22:~/rdf/dist/target/service-0.3.138-SNAPSHOT$ date
Mon Mar 4 06:31:06 PM CST 2024
ubuntu22:~/rdf/dist/target/service-0.3.138
dr0ptp4kt added a comment.
**More about bufferCapacity**
Similarly, with 150 munged files, was attempted with the buffer in
RWStore.properties increased from 10 to 10**0** with the target as the
NVMe.
com.bigdata.rdf.sail.bufferCapacity=100
ubuntu22:~/rdf/dist
dr0ptp4kt added a comment.
**More about NVMe versus SSD**
Runs were also done to see the effects on 150 munged files (out of a set of
2202 files) from the full Wikidata import, which allows for exercising more
disk related pieces. This was tried with both types of target disk - SATA SSD
dr0ptp4kt added a subscriber: ssingh.
dr0ptp4kt added a comment.
@ssingh would you mind if the following command is run on one of the newer
cp hosts with a new higher write throughput NVMe? If so, got a recommended
node? I don't have access, but I think @bking may.
`sudo sync; sudo
dr0ptp4kt added a comment.
Thanks @bking ! It looks like the NVMe in this one is not a higher speed one
for writes, and I'm also wondering if perhaps its write performance has
degraded with age. I'll paste in the results here, but this was slower than the
other servers, ironically (although
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE,
Invadibot
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE,
Invadibot
dr0ptp4kt added a comment.
First, adding some commands that were used for Blazegraph imports on Ubuntu
22.04. I had originally tried a good number of EC2 instance types, and then
after that went back to focus on just four of them with a sequence of
repeatable commands (this wasn't scripted
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE,
Invadibot
dr0ptp4kt added a comment.
@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish
with a hostname of `wdqs1025.eqiad.wmnet`? I'm wondering if maybe there's a
direct IP or IPs given that there don't seem to be DNS records for
`cp1086.eqiad.wmnet` or `cp1086
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE,
Invadibot
dr0ptp4kt moved this task from Incoming to Current work on the
Wikidata-Query-Service board.
dr0ptp4kt removed a project: Wikidata-Query-Service.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
WORKBOARD
https://phabricator.wikimedia.org/project/board/891/
EMAIL PREFERENCES
https
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen,
karapayneWMDE
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T359062
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen,
karapayneWMDE
dr0ptp4kt changed the task status from "Open" to "In Progress".
dr0ptp4kt triaged this task as "Medium" priority.
dr0ptp4kt claimed this task.
dr0ptp4kt added projects: Wikidata-Query-Service, Discovery-Search (Current
work).
dr0ptp4kt updated the task des
dr0ptp4kt added a comment.
Thanks @VRiley-WMF ! @bking is up next for imaging, I think.
TASK DETAIL
https://phabricator.wikimedia.org/T358727
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: VRiley-WMF, dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh
dr0ptp4kt added a parent task: T358533: Hardware requests for Search Platform
FY2024-2025.
TASK DETAIL
https://phabricator.wikimedia.org/T358727
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF, ssingh, RKemper
dr0ptp4kt added a parent task: T336443: Investigate performance differences
between wdqs2022 and older hosts.
TASK DETAIL
https://phabricator.wikimedia.org/T358727
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Jclark-ctr, VRiley-WMF
dr0ptp4kt added a comment.
I summarized at
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance
. When we have a mailing list post during the next week or so, we'll want to
move this to be a subpage of the target page of the post.
TASK DETAIL
https
dr0ptp4kt added a comment.
In T355037#9508760 <https://phabricator.wikimedia.org/T355037#9508760>,
@dcausse wrote:
> @dr0ptp4kt thanks! is the difference in the number of successful queries
only explained by the improvement in query time or are there some improvements
in t
dr0ptp4kt added a comment.
Here's the output from the latest run based upon a larger set of queries from
a random sample of WDQS queries.
$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar
org.aksw.iguana.rp.analysis.TabularTransform -e result.nt
dr0ptp4kt added a comment.
A run is in progress for 78K+ queries from a set of 100,000 random queries.
It should be done in under 10 hours from now.
scala> val full_random =
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified
dr0ptp4kt added a comment.
Following below are "per-query" summary stats. I actually just put this
together by bringing CSV data into Google Sheets for now - all of the columns
are calculated upon the "per-query" rows (but you'll see how the Mean
corresponds with the
dr0ptp4kt added a comment.
Here were the data produced by IGUANA once piped through the CSV utility
introduced in
https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs
with a command of the following form (for the attentive reader, note that I
had to rename
dr0ptp4kt added a comment.
Now a screenshot from the re-run of the randomized order queries, followed by
a screenshot showing the two runs on the randomized order queries side by side.
F41722569: Screenshot 2024-01-27 at 6.36.58 AM.png
<https://phabricator.wikimedia.org/F41722
dr0ptp4kt added a comment.
Now, the screenshot from the randomized order queries. I'll run one more time
to see that comparable output is achieved. Those were produced with the
following. This latest output file has been moved to `result.nt.003`.
scala> val joined6 = wikidata.as
dr0ptp4kt added a comment.
Now, a screenshot showing the re-run. And then a screenshot showing them
side-by-side. This is just for the visual, and the data produced from IGUANA
(what is in the `.nt` output that we can convert to a handy CSV) should be more
telling.
Next up, I'll
dr0ptp4kt added a comment.
Dropping in a screenshot from Grafana from this first pass and made a copy of
`result.nt` to `result.nt.001`. Re-running to see that server behavior is
similar.
F41718197: Screenshot 2024-01-25 at 7.43.14 PM.png
<https://phabricator.wikimedia.org/F41718
dr0ptp4kt added a comment.
For the first pass, the following configuration is being used for an hour
long test conducted from `stat1006` with config file `wdqs-split-test.yml` as
follows.
datasets:
- name: "split"
connections:
- name: "baseline&qu
dr0ptp4kt claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T355037
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1,
karapayneWMDE, Invadibot
dr0ptp4kt updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T355037
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1,
karapayneWMDE, Invadibot
dr0ptp4kt added a comment.
Imports seemed to work.
**Non-scholarly article side (proxied to wdqs1024.eqiad.wmnet)**
F41650681: split-non-schol-side.gif
<https://phabricator.wikimedia.org/F41650681>
**Scholarly article side (proxied to wdqs1023.eqiad.wmnet)**
F41650680:
dr0ptp4kt added a comment.
After an update to the script (PS6) and a fresh run of the same commands new
files have been `hdfs-rsync`'d to `stat1006:~dr0ptp4kt/gzips` in anticipation
of doing a file transfer over to the WDQS graph split test servers.
Here's a very small sample of what
dr0ptp4kt added a subscriber: RKemper.
dr0ptp4kt added a comment.
I ran the current version of the code as follows:
spark3-submit --master yarn --driver-memory 16G --executor-memory 12G
--executor-cores 4 --conf spark.driver.cores=2 --conf
spark.executor.memoryOverhead=4g --conf
dr0ptp4kt added a comment.
Not using right now, but here's roughly how one might go about generating
more expanded Turtle statements without reverse-mapping prefixes: F41561068
<https://phabricator.wikimedia.org/F41561068>
TASK DETAIL
https://phabricator.wikimedia.org/T350106
dr0ptp4kt added a subscriber: EBernhardson.
dr0ptp4kt added a comment.
Adding a note so I don't forget: advice from @BTullis is to avoid NFS if
possible, and advice from @JAllemandou is to consider use of `hdfs-rsync`
(after our call I sought this out and found these:
https
dr0ptp4kt claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T350106
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, dcausse,
Danny_Benjafield_WMDE, Astuthiodit_1
dr0ptp4kt added a comment.
The job completed. The counts match up on this productionized job compared
with the prior one run in my namespace. Following are some Hive queries in case
needed later. Below that is a really small sample of the resultant data in
tabular format for each partition
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt triaged this task as "High" priority.
TASK DETAIL
https://phabricator.wikimedia.org/T347989
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: EBernhardson, bking
dr0ptp4kt closed subtask T347989: Adapt rdf-spark-tools to split the wikidata
graph based on a set of rules as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T337013
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt
dr0ptp4kt added a subscriber: EBernhardson.
dr0ptp4kt added a comment.
Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch
merged. This is deployed to Search's Airflow instance and the job is running.
Thank you, @dcausse and @EBernhardson.
Here's the location
dr0ptp4kt moved this task from In Progress to Needs review on the
Discovery-Search (Current work) board.
dr0ptp4kt added a comment.
Here's what I saw after re-running. So, we should be good with the latest
patchset that goes without distinct() on the final graphs.
Without distinct
dr0ptp4kt added a comment.
Update: it seems to be working. This, I'd say this is maybe 75% complete.
It takes about 1h40m to run and generate the two different partitions.
WIP/Draft patches posted at
https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/969229 and ^ . They
require
dr0ptp4kt added a comment.
I also see
https://grafana.wikimedia.org/d/00264/wikidata-dump-downloads?orgId=1=5m=now-2y=now
which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and
delete many old dumps logs <https://phabricator.wikimedia.org/T280678> and
f
dr0ptp4kt added a comment.
^ Update.
TASK DETAIL
https://phabricator.wikimedia.org/T347605
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: bking, dr0ptp4kt
Cc: jochemla, Addshore, dr0ptp4kt, Aklapper, bking, Danny_Benjafield_WMDE,
Astuthiodit_1
dr0ptp4kt added a comment.
Looking at yesterday's downloads with a rudimentary grep we're not far from
1K downloads, and that's just for the //latest-all// ones. That also doesn't
consider mirrors.
stat1007:~$ zgrep wikidatawiki
/srv/log/webrequest/archive/dumps.wikimedia.org
dr0ptp4kt added a comment.
It took about 26min 24s to write `S_direct_triples` (7_293_925_470 rows) in
basic Parquet. It's not all the rows (not even for its own partition, as that
will include Value and Reference triples as well), but this means it ought to
be possible for the job to write
dr0ptp4kt added a comment.
TL;DR this is about 45% done.
This week I was working to address non-performant, often hanging or crashy,
Spark runs. Last night I managed to get this running better, producing a
reduction (the equivalent of `val_triples_only_used_by_sas` from
https
dr0ptp4kt added a comment.
Good question - I meant the contrast with respect to the .ttl.gz dumps and
everything that goes into munging and importing (in aggregate across all
downloaders of those files) versus the same for if this was done with the .jnl
where they don't have to munge
dr0ptp4kt added a comment.
Personalized dev environment on analytics cluster with Airflow setup
(stat1006) - was able to execute job, slightly hacked up to get specific date
and not keep running regularly (eats lots of disk) to get
`dr0ptp4kt.wikibase_rdf_with_split` using my Kerberos
dr0ptp4kt added a comment.
> I think the ammount of time taken to decompress the JNL file should also be
taken into consideration on varying hardware if compression is being considered.
Closing the loop, posted my experience at T347605#9229608
<https://phabricator.wikimedia.org/T
dr0ptp4kt added a comment.
@bking just wanted to express my gratitude for the support on this ticket and
its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org
<https://phabricator.wikimedia.org/T344905> and T347647: 2023-09-18
latest-all.ttl.gz WDQS dump `Fatal error m
dr0ptp4kt added a comment.
Addressing @Addshore's comment in T344905#9210122
<https://phabricator.wikimedia.org/T344905#9210122>...
> I think the amount of time taken to decompress the JNL file should also be
taken into consideration on varying hardware if compression
dr0ptp4kt added a comment.
Drawing from your inspiration, I downloaded with `wget` overnight and the
`sha1sum' now matches that from `wdqs1016`. Deflating now, will update with
results.
TASK DETAIL
https://phabricator.wikimedia.org/T347605
EMAIL PREFERENCES
https
dr0ptp4kt closed this task as "Resolved".
dr0ptp4kt added a comment.
I'm going to close this for now given that the later dump munged okay and
there seems to be an underlying issue somewhere probably related to file
transfer. The ``-- --skolemize`` will be a thing to consider for
dr0ptp4kt added a comment.
I did manage to run a `sha1sum` on the older dump where the import had failed.
/mnt/w$ time sha1sum latest-all.ttl.gz
dedad5a589b3a3661a1f9ebb7f1a6bcbce1b4ef2 latest-all.ttl.gz
real28m47.000s
user3m21.104s
sys 0m46.825s
dr0ptp4kt added a comment.
Here's the `sha1sum` for the latest file I had downloaded:
/mnt/x$ time sha1sum wikidata.jnl.zst
62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab wikidata.jnl.zst
real77m16.215s
user8m39.726s
sys 2m42.932s
TASK DETAIL
https
dr0ptp4kt claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T347989
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1,
AWesterinen, karapayneWMDE
dr0ptp4kt added a comment.
For me the first 300 GB of the file went really, really fast. But `axel` was
dropping connections, similar to when I had downloaded the large 1 TB file. So
this download took about 5 hours. I'm pretty sure it could be done in 1-3
hours, though, if everything were
dr0ptp4kt added a comment.
The addshore .jnl (August file) does launch nicely with `./runBlazegraph.sh`.
TASK DETAIL
https://phabricator.wikimedia.org/T347647
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Aklapper, Stang
dr0ptp4kt added a comment.
The addshore .jnl (August file) download completed, with use of the Linux
tool axel. Working from my memory as I checked on the download on my 1 Gbps
connection the first 800 or so GB downloaded over the first 3-4 hours, thrn (as
some Clodflare connections seemed
dr0ptp4kt added a comment.
Update - the newer dump munged without any problems.
TASK DETAIL
https://phabricator.wikimedia.org/T347647
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: Aklapper, Stang, dr0ptp4kt, bking
1 - 100 of 174 matches
Mail list logo