Hi Andy, Inline.
On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <[email protected]> wrote: > Gents, > > Two questions: > > 1. Say you have 5 folders with input data > (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster. > > You will write your MR job to access your files by listing them in : > > FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”); > > Q: Is there a way to move the above folders to the parent folder say, > “the_folder”, so that the dir struct will be the_folder/fold1, > the_folder/fold2... Will it be possible to access your files with something > like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar? > > I am asking in case your input folders list grows too long. How to curb > that? Yes, the FileInputFormat.addInputPath(…) API [1] supports glob patterns and you can pass it a Path object of "the_fold/*/*" or so. [1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path) > 2. Hypothetically speaking in fully-dist mode cluster your folders > with Data are located as follows: Node1: (fold1,fold2,fold3) and > Node2:(fold4, fold5) > > Q: Do we change below command or will NN and JT take care how of locating > those files? > > FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”); JT and NN take care of data locality for you. You need not worry about that (manually) at all. > 2a. Using Data balancer which splits input/moves Data across > additional DNs indicated in conf/slaves, is it possible to run “hdfs dfs > –ls –r “ command on the slave node that runs DN on a separate machine? I > have Yes, you can run regular HDFS client operations (such as ls, cat, job submission) from any machine, regardless of the machine being or not being a slave or master node. The form of access a client program uses is not tied to those files/aspects. > Cheers, > > AK > > NOTICE: This e-mail message and any attachments are confidential, subject to > copyright and may be privileged. Any unauthorized use, copying or disclosure > is prohibited. If you are not the intended recipient, please delete and > contact the sender immediately. Please consider the environment before > printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui > l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent > être couverts par le secret professionnel. Toute utilisation, copie ou > divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire > prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. > Veuillez penser à l'environnement avant d'imprimer le présent courriel -- Harsh J
