dr0ptp4kt added a comment.
A run is in progress for 78K+ queries from a set of 100,000 random queries.
It should be done in under 10 hours from now.
scala> val full_random =
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified.parquet")
scala> val wikidata_random =
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_random_classified.parquet")
scala> full_random.count
res0: Long = 100000
scala> wikidata_random.count
res6: Long = 100000
scala> val joined11 =
wikidata_random.as("w").join(full_random.as("f")).where("w.id = f.id and
w.success = true and w.success = f.success and w.resultSize = f.resultSize and
w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n###
BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0,
seed=42)
scala> joined11.count
res0: Long = 78862
scala> joined11.repartition(1).write.option("compression",
"none").text("queries_for_performance_2024_01_31.txt")
scala> :quit
$ hdfs dfs -copyToLocal
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_31.txt/part-00000-29c4e72d-800d-4148-b804-8e428ee71e9e-c000.txt
./queries_for_performance_file_renamed_randomized_2024_01_31.txt
$ bash start-iguana.sh wdqs-split-test-randomized-2024-01-31.yml
`start-iguna.sh` previously ran from `stat1006`, but this time around it's
running from `stat1008` in order to use more RAM for the larger query mix.
TASK DETAIL
https://phabricator.wikimedia.org/T355037
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1,
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi,
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen,
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]