dr0ptp4kt added a subscriber: RKemper.
dr0ptp4kt added a comment.
I ran the current version of the code as follows:
spark3-submit --master yarn --driver-memory 16G --executor-memory 12G
--executor-cores 4 --conf spark.driver.cores=2 --conf
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf
spark.dynamicAllocation.maxExecutors=128 --conf
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1
--class
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator
--name wikibase-rdf-statements-spark
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar
--input-table-partition-spec
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=wikidata_main
--output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main
--num-partitions 1024
spark3-submit --master yarn --driver-memory 16G --executor-memory 12G
--executor-cores 4 --conf spark.driver.cores=2 --conf
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf
spark.dynamicAllocation.maxExecutors=128 --conf
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1
--class
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator
--name wikibase-rdf-statements-spark
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar
--input-table-partition-spec
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=scholarly_articles
--output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol
--num-partitions 1024
And updated the permissions.
hdfs dfs -chgrp -R analytics-search-users
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main
hdfs dfs -chgrp -R analytics-search-users
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol
From stat1006 it is possible to use the already present `hdfs-rsync` (script
fronting Java utility) to copy the produced files, like this:
hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol/
file:/destination/tot/nt_wd_schol_gzips/
hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main/
file:/destination/to/nd_wd_main_gzips/
Note: each directory has 1,024 files of 100 MB +/- a certain number of MB.
The Spark routine randomly samples the data before sorting into partitions, and
although all partitions have data, there's mild skew so the files aren't all
exactly the same number of records.
@bking / @RKemper / @dcausse / I will discuss more this week.
TASK DETAIL
https://phabricator.wikimedia.org/T350106
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dr0ptp4kt
Cc: RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou,
dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71,
Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657,
karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978,
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD,
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420,
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst,
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS,
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles,
Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]