dr0ptp4kt added a comment.
Now, the screenshot from the randomized order queries. I'll run one more time
to see that comparable output is achieved. Those were produced with the
following. This latest output file has been moved to `result.nt.003`.
scala> val joined6 = wikidata.as("w").join(full.as("f")).where("w.id = f.id
and w.success = true and w.success = f.success and w.resultSize = f.resultSize
and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"),
lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false,
fraction=1.0, seed=42)
scala> joined6.count // matches same as joined5.count
scala> joined6.repartition(1).write.option("compression",
"none").text("queries_for_performance_randomized_2024_01_26.txt")
scala> :quit
$ hdfs dfs -copyToLocal
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_randomized_2024_01_26.txt/part-00000-131df78f-da7a-4ffc-aad4-9874342165ca-c000.txt
./queries_for_performance_randomized.txt
$ sha1sum queries_for_performance.txt queries_for_performance_randomized.txt
$ # they're different
$ diff queries_for_performance.txt queries_for_performance_randomized.txt |
wc -l
$ # they're very different
$ cp wdqs-split-test.yml wdqs-split-test-randomized.yml
$ # changed pointers to query file to be
queries_for_performance_randomized.txt
$ bash start-iguana.sh wdqs-split-test-randomized.yml
$ mv result.nt result.nt.003
TASK DETAIL
https://phabricator.wikimedia.org/T355037
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1,
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi,
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen,
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]