[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

dr0ptp4kt Mon, 04 Dec 2023 18:38:13 -0800

dr0ptp4kt added a subscriber: RKemper.
dr0ptp4kt added a comment.


  I ran the current version of the code as follows:
  
    spark3-submit --master yarn --driver-memory 16G --executor-memory 12G 
--executor-cores 4 --conf spark.driver.cores=2 --conf 
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf 
spark.dynamicAllocation.maxExecutors=128 --conf 
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 
--class 
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator 
--name wikibase-rdf-statements-spark 
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar 
--input-table-partition-spec 
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=wikidata_main
 --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main 
--num-partitions 1024
  
  
  
    spark3-submit --master yarn --driver-memory 16G --executor-memory 12G 
--executor-cores 4 --conf spark.driver.cores=2 --conf 
spark.executor.memoryOverhead=4g --conf spark.sql.shuffle.partitions=512 --conf 
spark.dynamicAllocation.maxExecutors=128 --conf 
spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.yarn.maxAppAttempts=1 
--class 
org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator 
--name wikibase-rdf-statements-spark 
~dr0ptp4kt/rdf-spark-tools-0.3.138-SNAPSHOT-jar-with-dependencies.jar 
--input-table-partition-spec 
discovery.wikibase_rdf_scholarly_split/snapshot=20231016/wiki=wikidata/scope=scholarly_articles
 --output-hdfs-path hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol 
--num-partitions 1024
  
  And updated the permissions.
  
    hdfs dfs -chgrp -R analytics-search-users 
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main
  
  
  
    hdfs dfs -chgrp -R analytics-search-users 
hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol
  
  From stat1006 it is possible to use the already present `hdfs-rsync` (script 
fronting Java utility) to copy the produced files, like this:
  
    hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_schol/ 
file:/destination/tot/nt_wd_schol_gzips/
  
  
  
    hdfs-rsync -r hdfs://analytics-hadoop/user/dr0ptp4kt/nt_wd_main/ 
file:/destination/to/nd_wd_main_gzips/
  
  Note: each directory has 1,024 files of 100 MB +/- a certain number of MB. 
The Spark routine randomly samples the data before sorting into partitions, and 
although all partitions have data, there's mild skew so the files aren't all 
exactly the same number of records.
  
  @bking / @RKemper / @dcausse / I will discuss more this week.

TASK DETAIL
  https://phabricator.wikimedia.org/T350106

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: RKemper, EBernhardson, Aklapper, BTullis, bking, dr0ptp4kt, JAllemandou, 
dcausse, Danny_Benjafield_WMDE, Isabelladantes1983, Themindcoder, Adamm71, 
Jersione, Hellket777, LisafBia6531, Astuthiodit_1, AWesterinen, 786, Biggs657, 
karapayneWMDE, Invadibot, maantietaja, Juan90264, Alter-paule, Beast1978, 
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, 
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T350106: Implement a spark job that converts a RDF triples table into a RDF file format

Reply via email to