Thanks for the info Nauroth, will try the distch. Sorry for the late response.
For a chmod -R call on one directory, I see that there are many calls to the namenode, I assume the recursion is done by the client. Isn't it better that the recursion is done by the name and having a re-entrant lock, instead of having a recursion over the network and taking the lock for every call? On Thu, Jun 16, 2016 at 11:24 AM, Chris Nauroth <[email protected]> wrote: > Hello Ravi, > > You might consider using DistCh. In the same way that DistCp is a > distributed copy implemented as a MapReduce job, DistCh is a MapReduce job > that distributes the work of chmod/chown. > > DistCh will become easier to access through convenient shell commands in > Apache Hadoop 3. In version 2.6.0, it's undocumented and hard to find, but > it's still there. It's inside the hadoop-extras.jar. Here is an example > invocation: > > hadoop jar share/hadoop/tools/lib/hadoop-extras-*.jar > org.apache.hadoop.tools.DistCh > > It might take some fiddling with the classpath to get this right. If so, > then I recommend looking at how the shell scripts in trunk set up the > classpath. > > > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-extras/src/main/shellprofile.d/hadoop-extras.sh > > As you pointed out, this would generate higher NameNode traffic compared > to your typical baseline load. To mitigate this, I recommend that you > start with a test run in a non-production environment to see how it reacts. > > --Chris Nauroth > > From: ravi teja <[email protected]> > Date: Wednesday, June 15, 2016 at 8:33 PM > To: "[email protected]" <[email protected]> > Subject: Bulk chmod,chown operations on HDFS > > Hi Community, > > As part of the new authorisation changes, we need to change the > permissions and owners of many files in hdfs (2.6.0) with chmod and chown. > > To do this we need to stop the processing on the directories to avoid > inconsistencies in permissions, hence we need to take a downtime for those > specific pipelines operating on these folders. > > > The total number of files/directories to be operated upon is around 10 > Million. > A chmod recursive (chmod -R) on 160K objects, has taken around 15 minutes. > > At this rate it will take a long time to complete the operation and the > downtime would be couple of hours. > > Mapreduce program is one option, but chmod,chown being a heavy > operations, will slow down the cluster for other users, if done at this > scale. > > Are there any options to do a bulk permissions changes chmod,chown to > avoid these issues? > If not are there any alternative approaches to carry the same operation at > this scale something like admin backdoor to fsimage? > > > > Thanks, > Ravi Teja >
