EBernhardson added a comment.
In T342416#9101474 <https://phabricator.wikimedia.org/T342416#9101474>, @JAllemandou wrote: > In T342416#9091146 <https://phabricator.wikimedia.org/T342416#9091146>, @EBernhardson wrote: > >> Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference? >> >> drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716 >> drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723 >> drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730 >> drwxrwxr-x /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806 > > The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :) These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use `df.write.insertInto(table_name)` to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions. I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here. TASK DETAIL https://phabricator.wikimedia.org/T342416 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson Cc: dcausse, BTullis, AndrewTavis_WMDE, Aklapper, JAllemandou, Danny_Benjafield_WMDE, Mohamed-Awnallah, Astuthiodit_1, AWesterinen, lbowmaker, karapayneWMDE, Invadibot, Ywats0ns, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
