1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: <absolute path of the tar file> 3. Write a spark application. It is like: val rdd = sc.textFile (<HDFS path of the temp dir>) rdd.map { line => construct an untar command using the path information in “line” and launches the command }
> On May 19, 2016, at 14:42, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > I have few tar files in HDFS in a single folder. each file has multiple files > in it. > > tar1: > - f1.txt > - f2.txt > tar2: > - f1.txt > - f2.txt > > (each tar file will have exact same number of files, same name) > > I am trying to find a way (spark or pig) to extract them to their own > folders. > > f1 > - tar1_f1.txt > - tar2_f1.txt > f2: > - tar1_f2.txt > - tar1_f2.txt > > Any help? > > > > -- > Best Regards, > Ayan Guha --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org